Re: [patch] PR tree-optimization/96818 - cast label range to type of switch operand

2020-09-03 Thread Aldy Hernandez via Gcc-patches




On 8/31/20 2:55 PM, Richard Biener wrote:

On Mon, Aug 31, 2020 at 11:10 AM Aldy Hernandez  wrote:




On 8/31/20 10:33 AM, Richard Biener wrote:

On Mon, Aug 31, 2020 at 10:20 AM Aldy Hernandez  wrote:


As discussed in the PR, the type of a switch operand may be different
than the type of the individual cases.  This is causing us to attempt to
intersect incompatible ranges.

This inconsistent IL seems wrong to me, but it also seems pervasive
throughout GCC.  Jason, as one of the original gimple designers, do you
remember if this was an oversight, or was there a reason for this?

Richard, you mention in the PR that normalizing the IL would cause
propagation of widening cast sources into switches harder.  Couldn't we
just cast all labels to the switch operand type upon creation?  What
would be the problem with that?  Would we generate sub-optimal code?

In either case, I'd hate to leave trunk in a broken case while this is
being discussed.  The attached patch fixes the PRs.

OK?


 widest_irange label_range (CASE_LOW (min_label), case_high);
+  if (!types_compatible_p (label_range.type (), range_of_op->type ()))
+   range_cast (label_range, range_of_op->type ());
 label_range.intersect (range_of_op);

so label_range is of 'widest_int' type but then instead of casting
range_of_op to widest_irange you cast that widest_irange to the
type of the range op?  Why not build label_range in the type
of range_op immediately?


The "widest" in widest_irange does not denote the type of the range, but
the number of sub-ranges that can maximally fit.  It has nothing to do
with the underlying type of its elements.  Instead, it is supposed to
indicate that label_range can fit an infinite amount of sub-ranges.  In
practice this is 255 sub-ranges, but we can adjust that if needed.


That's definitely confusing naming then :/


I've been mulling this over, and you're right.  The naming does imply 
some relation to widest_int.


Originally I batted some ideas on naming this, but came up short.  I'm 
still not sure what it should be.


max_irange?

biggest_irange?

Perhaps some name involving "int_range", since int_range<> is how ranges 
are declared (int_range_max??).


Suggestions welcome.
Aldy



Re: [PATCH] [AVX512] [PR87767] Optimize memory broadcast for constant vector under AVX512

2020-09-03 Thread Jakub Jelinek via Gcc-patches
On Thu, Sep 03, 2020 at 10:11:14AM +0800, Hongtao Liu wrote:
> Thanks for the review, update patch.

Ok for trunk, thanks.

> From acf3825279190ca0540bb4704f66568fdbe06ce8 Mon Sep 17 00:00:00 2001
> From: liuhongt 
> Date: Wed, 8 Jul 2020 17:14:36 +0800
> Subject: [PATCH] Optimize memory broadcast for constant vector under AVX512.
> 
> For constant vector having one duplicated value, there's no need to put
> whole vector in the constant pool, using embedded broadcast instead.
> 
> 2020-07-09  Hongtao Liu  
> 
> gcc/ChangeLog:
> 
>   PR target/87767
>   * config/i386/i386-features.c
>   (replace_constant_pool_with_broadcast): New function.
>   (constant_pool_broadcast): Ditto.
>   (class pass_constant_pool_broadcast): New pass.
>   (make_pass_constant_pool_broadcast): Ditto.
>   (remove_partial_avx_dependency): Call
>   replace_constant_pool_with_broadcast under TARGET_AVX512F, it
>   would save compile time when both pass rpad and cpb are
>   available.
>   (remove_partial_avx_dependency_gate): New function.
>   (class pass_remove_partial_avx_dependency::gate): Call
>   remove_partial_avx_dependency_gate.
>   * config/i386/i386-passes.def: Insert new pass after combine.
>   * config/i386/i386-protos.h
>   (make_pass_constant_pool_broadcast): Declare.
>   * config/i386/sse.md (*avx512dq_mul3_bcst):
>   New define_insn.
>   (*avx512f_mul3_bcst): Ditto.
>   * config/i386/avx512fintrin.h (_mm512_set1_ps,
>   _mm512_set1_pd,_mm512_set1_epi32, _mm512_set1_epi64): Adjusted.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR target/87767
>   * gcc.target/i386/avx2-broadcast-pr87767-1.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-1.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-2.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-3.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-4.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-5.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-6.c: New test.
>   * gcc.target/i386/avx512f-broadcast-pr87767-7.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-2.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-3.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-4.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: New test.
>   * gcc.target/i386/avx512vl-broadcast-pr87767-6.c: New test.

Jakub



Re: [patch] PR tree-optimization/96818 - cast label range to type of switch operand

2020-09-03 Thread Richard Biener via Gcc-patches
On Thu, Sep 3, 2020 at 9:21 AM Aldy Hernandez  wrote:
>
>
>
> On 8/31/20 2:55 PM, Richard Biener wrote:
> > On Mon, Aug 31, 2020 at 11:10 AM Aldy Hernandez  wrote:
> >>
> >>
> >>
> >> On 8/31/20 10:33 AM, Richard Biener wrote:
> >>> On Mon, Aug 31, 2020 at 10:20 AM Aldy Hernandez  wrote:
> 
>  As discussed in the PR, the type of a switch operand may be different
>  than the type of the individual cases.  This is causing us to attempt to
>  intersect incompatible ranges.
> 
>  This inconsistent IL seems wrong to me, but it also seems pervasive
>  throughout GCC.  Jason, as one of the original gimple designers, do you
>  remember if this was an oversight, or was there a reason for this?
> 
>  Richard, you mention in the PR that normalizing the IL would cause
>  propagation of widening cast sources into switches harder.  Couldn't we
>  just cast all labels to the switch operand type upon creation?  What
>  would be the problem with that?  Would we generate sub-optimal code?
> 
>  In either case, I'd hate to leave trunk in a broken case while this is
>  being discussed.  The attached patch fixes the PRs.
> 
>  OK?
> >>>
> >>>  widest_irange label_range (CASE_LOW (min_label), case_high);
> >>> +  if (!types_compatible_p (label_range.type (), range_of_op->type 
> >>> ()))
> >>> +   range_cast (label_range, range_of_op->type ());
> >>>  label_range.intersect (range_of_op);
> >>>
> >>> so label_range is of 'widest_int' type but then instead of casting
> >>> range_of_op to widest_irange you cast that widest_irange to the
> >>> type of the range op?  Why not build label_range in the type
> >>> of range_op immediately?
> >>
> >> The "widest" in widest_irange does not denote the type of the range, but
> >> the number of sub-ranges that can maximally fit.  It has nothing to do
> >> with the underlying type of its elements.  Instead, it is supposed to
> >> indicate that label_range can fit an infinite amount of sub-ranges.  In
> >> practice this is 255 sub-ranges, but we can adjust that if needed.
> >
> > That's definitely confusing naming then :/
>
> I've been mulling this over, and you're right.  The naming does imply
> some relation to widest_int.
>
> Originally I batted some ideas on naming this, but came up short.  I'm
> still not sure what it should be.
>
> max_irange?
>
> biggest_irange?
>
> Perhaps some name involving "int_range", since int_range<> is how ranges
> are declared (int_range_max??).

The issue is that widest_irange has a connection to the type of the range
while it seems to constrain the number of subranges.  So if there's
a irange<1>, irange<2>, etc. then irange_max would be a typedef that
makes most sense (both <2> and _max are then suffixes).

Hmm, a simple grep reveals

value-range.h:typedef int_range<255> widest_irange;
value-range.h:class GTY((user)) int_range : public irange

so widest_irange is not an irange but an int_range?  But then
there's little use of 'irange' or 'int_range', most code uses
either irange & arguments (possibly accepting sth else than
int_range) and variables are widest_irange or irange3
(typedefed from int_range<3>).

IMHO a typedef irange* to int_range is confusing a bit.

Why isn't it int_range3 or widest_int_rage?  Why is there
irange and int_range?

Well, still stand with irange_max, or, preferably int_range_max.
Or ...

template
class GTY((user)) int_range : public irange
{

so people can use

int_range x;

and get the max range by default?

But back to the question of just renaming widest_irange.  Use irange_max.
The other confusion still stands of course.

Richard.

> Suggestions welcome.
> Aldy
>


[committed] d: Only test with default permutation flags for runnable tests.

2020-09-03 Thread Iain Buclaw via Gcc-patches
Hi,

This patch alters the way the D2 testsuite is ran.

- All compilable tests now default to being compiled, rather than
  assembled.  A new directive has been added "LINK" to set the test
  action to link.

- Unless the test explicitly requests, all compilable tests as well as
  fail_compilation tests will be ran without any extra flags.

- The C++ tests now are checked against shared D runtime library.

Tested on x86_64-linux-gnu, and committed to mainline.

Regards
Iain.

---
gcc/testsuite/ChangeLog:

* lib/gdc-utils.exp (gdc-convert-test): Handle LINK directive.
Set PERMUTE_ARGS as DEFAULT_DFLAGS only for runnable tests.
(gdc-do-test): Set default action of compilable tests to compile.
Test SHARED_OPTION on runnable_cxx tests.
---
 gcc/testsuite/lib/gdc-utils.exp | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/gcc/testsuite/lib/gdc-utils.exp b/gcc/testsuite/lib/gdc-utils.exp
index 37c1620f307..b1f90b8f92e 100644
--- a/gcc/testsuite/lib/gdc-utils.exp
+++ b/gcc/testsuite/lib/gdc-utils.exp
@@ -192,6 +192,7 @@ proc gdc-copy-extra { base extra } {
 #   PERMUTE_ARGS:  The set of arguments to permute in multiple compiler
 #  invocations.  An empty set means only one permutation
 #  with no arguments.
+#   LINK:  Enables linking.
 #   TEST_OUTPUT:   The output expected from the compilation.
 #   POST_SCRIPT:   Not handled.
 #   REQUIRED_ARGS: Arguments to add to the compiler command line.
@@ -203,12 +204,13 @@ proc gdc-convert-test { base test } {
 global PERMUTE_ARGS
 global GDC_EXECUTE_ARGS
 
-set PERMUTE_ARGS $DEFAULT_DFLAGS
+set PERMUTE_ARGS ""
 set GDC_EXECUTE_ARGS ""
 
 set extra_sources ""
 set extra_files ""
 set needs_phobos 0
+set saw_test_flags 0
 
 upvar 1 compilable_do_what compilable_do_what
 set compilable_output_file_ext ""
@@ -237,6 +239,10 @@ proc gdc-convert-test { base test } {
# DISABLED is not handled.
regsub -- {DISABLED.*$} $copy_line "" out_line
 
+   } elseif [regexp -- {LINK:} $copy_line] {
+   # LINK sets dg-do-what-default "link"
+   set compilable_do_what "link"
+
} elseif [regexp -- {POST_SCRIPT} $copy_line] {
# POST_SCRIPT is not handled
regsub -- {POST_SCRIPT.*$} $copy_line "" out_line
@@ -246,14 +252,13 @@ proc gdc-convert-test { base test } {
if { $args != "" } {
error "gdc-convert-test: DFLAGS is not empty as expected"
}
-   if { $PERMUTE_ARGS == $DEFAULT_DFLAGS } {
-   set PERMUTE_ARGS ""
-   }
+   set saw_test_flags 1
regsub -- {DFLAGS.*$} $copy_line "" out_line
 
} elseif [regexp -- {PERMUTE_ARGS\s*:\s*(.*)} $copy_line match args] {
# PERMUTE_ARGS is handled by gdc-do-test.
set PERMUTE_ARGS [gdc-convert-args $args]
+   set saw_test_flags 1
regsub -- {PERMUTE_ARGS.*$} $copy_line "" out_line
 
} elseif [regexp -- {EXECUTE_ARGS\s*:\s*(.*)} $copy_line match args] {
@@ -339,6 +344,7 @@ proc gdc-convert-test { base test } {
 # Fail compilable are successful if an output is not generated.
 # Runnable must compile, link, and return 0 to be successful by default.
 switch $type {
+   runnable_cxx -
runnable {
if ![isnative] {
puts $fdout "// { dg-final { output-exists } }"
@@ -346,6 +352,11 @@ proc gdc-convert-test { base test } {
if $needs_phobos {
puts $fdout "// { dg-skip-if \"imports phobos\" { ! 
d_runtime_has_std_library } }"
}
+   # Run runnable tests with all default permutations if not
+   # explicitly set in the test file.
+   if !$saw_test_flags {
+   set PERMUTE_ARGS $DEFAULT_DFLAGS
+   }
}
 
compilable {
@@ -353,7 +364,6 @@ proc gdc-convert-test { base test } {
 
# Compilable test may require checking another kind of output file.
if { $compilable_output_file_ext != "" } {
-   set compilable_do_what "compile"
# Check that file generation tests output the expected file.
set genfile "[file rootname $name].$compilable_output_file_ext"
puts $fdout "// { dg-final { if \[file exists $genfile\] \\{
   } }"
@@ -370,11 +380,6 @@ proc gdc-convert-test { base test } {
}
 
fail_compilation {
-   # Fail compilation tests only check for language errors from the
-   # front-end.  No need to run all permutations of the default DFLAGS.
-   if { $PERMUTE_ARGS == $DEFAULT_DFLAGS } {
-   set PERMUTE_ARGS ""
-   }
puts $fdout "// { dg-final { output-exists-not } }"
}
 }
@@ -451,11 +456,11 @@ proc gdc-do-test { testcases } {
# Convert to DG test.
 

[committed] d: Move all runnable tests in gdc.dg to gdc.dg/torture

2020-09-03 Thread Iain Buclaw via Gcc-patches
Hi,

This patch creates a new test subdirectory gdc.dg/torture, moving all
runnable tests into it.

Tests that are not executed do not need to be compiled as torture tests,
they are only present for testing for a certain bug or ICE.

Tested on x86_64-linux-gnu, and committed to mainline.

Regards
Iain.

---
gcc/testsuite/ChangeLog:

* gdc.dg/dg.exp: Remove torture options.
* gdc.dg/gdc115.d: Move test to gdc.dg/torture.
* gdc.dg/gdc131.d: Likewise.
* gdc.dg/gdc141.d: Likewise.
* gdc.dg/gdc17.d: Likewise.
* gdc.dg/gdc171.d: Likewise.
* gdc.dg/gdc179.d: Likewise.
* gdc.dg/gdc186.d: Likewise.
* gdc.dg/gdc187.d: Likewise.
* gdc.dg/gdc191.d: Likewise.
* gdc.dg/gdc198.d: Likewise.
* gdc.dg/gdc200.d: Likewise.
* gdc.dg/gdc210.d: Likewise.
* gdc.dg/gdc240.d: Likewise.
* gdc.dg/gdc242b.d: Likewise.
* gdc.dg/gdc248.d: Likewise.
* gdc.dg/gdc250.d: Likewise.
* gdc.dg/gdc273.d: Likewise.
* gdc.dg/gdc283.d: Likewise.
* gdc.dg/gdc285.d: Likewise.
* gdc.dg/gdc286.d: Likewise.
* gdc.dg/gdc309.d: Likewise.
* gdc.dg/gdc35.d: Likewise.
* gdc.dg/gdc36.d: Likewise.
* gdc.dg/gdc51.d: Likewise.
* gdc.dg/gdc57.d: Likewise.
* gdc.dg/gdc66.d: Likewise.
* gdc.dg/imports/gdc36.d: Likewise.
* gdc.dg/init1.d: Likewise.
* gdc.dg/pr92309.d: Likewise.
* gdc.dg/pr94424.d: Likewise.
* gdc.dg/pr94777b.d: Likewise.
* gdc.dg/pr96152.d: Likewise.
* gdc.dg/pr96153.d: Likewise.
* gdc.dg/pr96156.d: Likewise.
* gdc.dg/pr96157a.d: Likewise.
* gdc.dg/torture/torture.exp: New file.
---
 gcc/testsuite/gdc.dg/dg.exp   | 24 ++---
 gcc/testsuite/gdc.dg/{ => torture}/gdc115.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc131.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc141.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc17.d|  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc171.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc179.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc186.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc187.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc191.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc198.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc200.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc210.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc240.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc242b.d  |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc248.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc250.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc273.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc283.d   |  3 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc285.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc286.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc309.d   |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc35.d|  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc36.d|  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc51.d|  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc57.d|  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/gdc66.d|  2 +-
 .../gdc.dg/{ => torture}/imports/gdc36.d  |  0
 gcc/testsuite/gdc.dg/{ => torture}/init1.d|  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr92309.d  |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr94424.d  |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr94777b.d |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr96152.d  |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr96153.d  |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr96156.d  |  2 +-
 gcc/testsuite/gdc.dg/{ => torture}/pr96157a.d |  2 +-
 gcc/testsuite/gdc.dg/torture/torture.exp  | 50 +++
 37 files changed, 90 insertions(+), 53 deletions(-)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc115.d (89%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc131.d (89%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc141.d (89%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc17.d (95%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc171.d (95%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc179.d (93%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc186.d (97%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc187.d (94%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc191.d (99%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc198.d (97%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc200.d (90%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc210.d (97%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc240.d (89%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc242b.d (91%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc248.d (94%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc250.d (90%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc273.d (91%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc283.d (96%)
 rename gcc/testsuite/gdc.dg/{ => torture}/gdc285.d (91%)
 rename gcc/testsuite/gdc.dg

[PATCH] lto: Cache location_ts including BLOCKs in GIMPLE streaming [PR94311]

2020-09-03 Thread Jakub Jelinek via Gcc-patches
Hi!

As mentioned in the PR, when compiling valgrind even on fairly small
testcase where in one larger function the location keeps oscillating
between a small line number and 8000-ish line number in the same file
we very quickly run out of all possible location_t numbers and because of
that emit non-sensical line numbers in .debug_line.
There are ways how to decrease speed of depleting location_t numbers
in libcpp, but the main reason of this is that we use
stream_input_location_now for streaming in location_t for gimple_location
and phi arg locations.  libcpp strongly prefers that the locations
it is given are sorted by the different files and by line numbers in
ascending order, otherwise it depletes quickly no matter what and is much
more costly (many extra file changes etc.).
The reason for not caching those were the BLOCKs that were streamed
immediately after the location and encoded into the locations (and for PHIs
we failed to stream the BLOCKs altogether).
This patch enhances the location cache to handle also BLOCKs (but not for
everything, only for the spots we care about the BLOCKs) and also optimizes
the size of the LTO stream by emitting a single bit into a pack whether the
BLOCK changed from last case and only streaming the BLOCK tree if it
changed.

Bootstapped/regtested on x86_64-linux with
../configure --enable-languages=default,ada,obj-c++,lto,go,brig,d 
--with-build-config=bootstrap-lto --enable-checking=yes,rtl,extra && make -j16 
bootstrap > LOG 2>&1 && GXX_TESTSUITE_STDS=98,11,14,17,2a make -j32 -k check > 
LOGC 2>&1; ../contrib/test_summary > LOGT 2>&1
as well as normally bootstrapped/regtested on x86_64-linux and i686-linux,
ok for trunk?

BTW, I wonder about one preexisting issue (but that can be done
incrementally).  On the streamer out side, we call clear_line_info
in multiple spots which resets the current_* values to something, but on the
reader side, we don't have corresponding resets in the same location, just have
the stream_* static variables that keep the current values through the
entire stream in (so across all the clear_line_info spots in a single LTO
object but also across jumping from one LTO object to another one).
Now, in an earlier version of my patch it actually broke LTO bootstrap
(and a lot of LTO testcases), so for the BLOCK case I've solved it by
clear_line_info setting current_block to something that should never appear,
which means that in the LTO stream after the clear_line_info spots including
the start of the LTO stream we force the block change bit to be set and thus
BLOCK to be streamed and therefore stream_block from earlier to be
ignored.  But for the rest I think that is not the case, so I wonder if we
don't sometimes end up with wrong line/column info because of that, or
please tell me what prevents that.
clear_line_info does:
  ob->current_file = NULL;
  ob->current_line = 0;
  ob->current_col = 0;
  ob->current_sysp = false;
while I think NULL current_file is something that should likely be different
from expanded_location (...).file (UNKNOWN_LOCATION/BUILTINS_LOCATION are
handled separately and not go through the caching), I think line number 0
can sometimes occur and especially column 0 occurs frequently if we ran out
of location_t with columns info.  But then we do:
  bp_pack_value (bp, ob->current_file != xloc.file, 1);
  bp_pack_value (bp, ob->current_line != xloc.line, 1);
  bp_pack_value (bp, ob->current_col != xloc.column, 1);
and stream the details only if the != is true.  If that happens immediately
after clear_line_info and e.g. xloc.column is 0, we would stream 0 bit and
not stream the actual value, so on read-in it would reuse whatever
stream_col etc. were before.  Shouldn't we set some ob->current_* new bit
that would signal we are immediately past clear_line_info which would force
all these != checks to non-zero?  Either by oring something into those
tests, or perhaps:
  if (ob->current_reset)
{
  if (xloc.file == NULL)
ob->current_file = "";
  if (xloc.line == 0)
ob->current_line = 1;
  if (xloc.column == 0)
ob->current_column = 1;
  ob->current_reset = false;
}
before doing those bp_pack_value calls with a comment, effectively forcing
all 6 != comparisons to be true?

2020-09-03  Jakub Jelinek  

PR lto/94311
* gimple.h (gimple_location_ptr, gimple_phi_arg_location_ptr): New
functions.
* streamer-hooks.h (struct streamer_hooks): Add
output_location_and_block callback.  Fix up formatting for
output_location.
(stream_output_location_and_block): Define.
* lto-streamer.h (class lto_location_cache): Fix comment typo.  Add
current_block member.
(lto_location_cache::input_location_and_block): New method.
(lto_location_cache::lto_location_cache): Initialize current_block.
(lto_location_cache::cached_location): Add block member.
(struct output_block): Add current_block member.
   

[Ada] Look at fullest view when checking for static types in unnesting

2020-09-03 Thread Arnaud Charlet
When seeing if any bound involved in a type is an uplevel reference,
we must look at the fullest view of a type, since that's what the
backends will do.  Similarly for private types. We introduce
Get_Fullest_View for that purpose.
   
Tested on x86_64-pc-linux-gnu, committed on master

* sem_util.ads, sem_util.adb (Get_Fullest_View): New procedure.
* exp_unst.adb (Check Static_Type): Do all processing on fullest
view of specified type.

diff --git gcc/ada/exp_unst.adb gcc/ada/exp_unst.adb
index 29fe2e5..ffc30c3 100644
--- gcc/ada/exp_unst.adb
+++ gcc/ada/exp_unst.adb
@@ -471,21 +471,23 @@ package body Exp_Unst is
 Callee : Entity_Id;
 
 procedure Check_Static_Type
-  (T: Entity_Id;
+  (In_T : Entity_Id;
N: Node_Id;
DT   : in out Boolean;
Check_Designated : Boolean := False);
---  Given a type T, checks if it is a static type defined as a type
---  with no dynamic bounds in sight. If so, the only action is to
---  set Is_Static_Type True for T. If T is not a static type, then
---  all types with dynamic bounds associated with T are detected,
---  and their bounds are marked as uplevel referenced if not at the
---  library level, and DT is set True. If N is specified, it's the
---  node that will need to be replaced. If not specified, it means
---  we can't do a replacement because the bound is implicit.
-
---  If Check_Designated is True and T or its full view is an access
---  type, check whether the designated type has dynamic bounds.
+--  Given a type In_T, checks if it is a static type defined as
+--  a type with no dynamic bounds in sight. If so, the only
+--  action is to set Is_Static_Type True for In_T. If In_T is
+--  not a static type, then all types with dynamic bounds
+--  associated with In_T are detected, and their bounds are
+--  marked as uplevel referenced if not at the library level,
+--  and DT is set True. If N is specified, it's the node that
+--  will need to be replaced. If not specified, it means we
+--  can't do a replacement because the bound is implicit.
+
+--  If Check_Designated is True and In_T or its full view
+--  is an access type, check whether the designated type
+--  has dynamic bounds.
 
 procedure Note_Uplevel_Ref
   (E  : Entity_Id;
@@ -505,11 +507,13 @@ package body Exp_Unst is
 ---
 
 procedure Check_Static_Type
-  (T: Entity_Id;
+  (In_T : Entity_Id;
N: Node_Id;
DT   : in out Boolean;
Check_Designated : Boolean := False)
 is
+   T : constant Entity_Id := Get_Fullest_View (In_T);
+
procedure Note_Uplevel_Bound (N : Node_Id; Ref : Node_Id);
--  N is the bound of a dynamic type. This procedure notes that
--  this bound is uplevel referenced, it can handle references
diff --git gcc/ada/sem_util.adb gcc/ada/sem_util.adb
index 679b3be..a80cc5c 100644
--- gcc/ada/sem_util.adb
+++ gcc/ada/sem_util.adb
@@ -9958,6 +9958,79 @@ package body Sem_Util is
   end if;
end Get_Enum_Lit_From_Pos;
 
+   --
+   -- Get_Fullest_View --
+   --
+
+   function Get_Fullest_View
+ (E : Entity_Id; Include_PAT : Boolean := True) return Entity_Id is
+   begin
+  --  Strictly speaking, the recursion below isn't necessary, but
+  --  it's both simplest and safest.
+
+  case Ekind (E) is
+ when Incomplete_Kind =>
+if From_Limited_With (E) then
+   return Get_Fullest_View (Non_Limited_View (E), Include_PAT);
+elsif Present (Full_View (E)) then
+   return Get_Fullest_View (Full_View (E), Include_PAT);
+elsif Ekind (E) = E_Incomplete_Subtype then
+   return Get_Fullest_View (Etype (E));
+end if;
+
+ when Private_Kind =>
+if Present (Underlying_Full_View (E)) then
+   return
+ Get_Fullest_View (Underlying_Full_View (E), Include_PAT);
+elsif Present (Full_View (E)) then
+   return Get_Fullest_View (Full_View (E), Include_PAT);
+elsif Etype (E) /= E then
+   return Get_Fullest_View (Etype (E), Include_PAT);
+end if;
+
+ when Array_Kind =>
+if Include_PAT and then Present (Packed_Array_Impl_Type (E)) then
+   return Get_Fullest_View (Packed_Array_Impl_Type (E));
+end if;
+
+ when E_Record_Subty

[PATCH] c++: Fix another PCH hash_map issue [PR96901]

2020-09-03 Thread Jakub Jelinek via Gcc-patches
Hi!

The recent libstdc++ changes caused lots of libstdc++-v3 tests FAILs
on i686-linux, all of them in the same spot during constexpr evaluation
of a recursive _S_gcd call.
The problem is yet another hash_map that used the default hasing of
tree keys through pointer hashing which is preserved across PCH write/read.
During PCH handling, the addresses of GC objects are changed, which means
that the hash values of the keys in such hash tables change without those
hash tables being rehashed.  Which in the fundef_copies_table case usually
means we just don't find a copy of a FUNCTION_DECL body for recursive uses
and start from scratch.  But when the hash table keeps growing, the "dead"
elements in the hash table can sometimes reappear and break things.
In particular what I saw under the debugger is when the fundef_copies_table
hash map has been used on the outer _S_gcd call, it didn't find an entry for
it, so returned a slot with *slot == NULL, which is treated as that the
function itself is used directly (i.e. no recursion), but that addition of
a hash table slot caused the recursive _S_gcd call to actually find
something in the hash table, unfortunately not the new *slot == NULL spot,
but a different one from the pre-PCH streaming which contained the returned
toplevel (non-recursive) call entry for it, which means that for the
recursive _S_gcd call we actually used the same trees as for the outer ones
rather than a copy of those, which breaks constexpr evaluation.

Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux,
ok for trunk (and given that it is latent eventually for release branches
too)?

2020-09-03  Jakub Jelinek  

PR c++/96901
* tree.h (struct decl_tree_traits): New type.
(decl_tree_map): New typedef.

* constexpr.c (fundef_copies_table): Change type from
hash_map * to decl_tree_map *.

--- gcc/tree.h.jj   2020-08-31 22:50:28.545593391 +0200
+++ gcc/tree.h  2020-09-02 17:44:45.478358927 +0200
@@ -5453,6 +5453,11 @@ struct type_tree_cache_traits
   : simple_cache_map_traits { };
 typedef hash_map type_tree_cache_map;
 
+/* Similarly to decl_tree_cache_map, but without caching.  */
+struct decl_tree_traits
+  : simple_hashmap_traits { };
+typedef hash_map decl_tree_map;
+
 /* Initialize the abstract argument list iterator object ITER with the
arguments from CALL_EXPR node EXP.  */
 static inline void
--- gcc/cp/constexpr.c.jj   2020-09-01 22:47:49.946383670 +0200
+++ gcc/cp/constexpr.c  2020-09-02 17:46:35.267758742 +0200
@@ -1203,7 +1203,7 @@ maybe_initialize_constexpr_call_table (v
 
This is not GC-deletable to avoid GC affecting UID generation.  */
 
-static GTY(()) hash_map *fundef_copies_table;
+static GTY(()) decl_tree_map *fundef_copies_table;
 
 /* Reuse a copy or create a new unshared copy of the function FUN.
Return this copy.  We use a TREE_LIST whose PURPOSE is body, VALUE

Jakub



[PATCH] arm: Improve immediate generation for thumb-1 with -mpurecode [PR96769]

2020-09-03 Thread Christophe Lyon via Gcc-patches
This patch moves the move-immediate splitter after the regular ones so
that it has lower precedence, and updates its constraints.

For
int f3 (void) { return 0x1100; }
int f3_2 (void) { return 0x12345678; }

we now generate:
* with -O2 -mcpu=cortex-m0 -mpure-code:
f3:
movsr0, #136
lslsr0, r0, #21
bx  lr
f3_2:
movsr0, #18
lslsr0, r0, #8
addsr0, r0, #52
lslsr0, r0, #8
addsr0, r0, #86
lslsr0, r0, #8
addsr0, r0, #121
bx  lr

* with -O2 -mcpu=cortex-m23 -mpure-code:
f3:
movsr0, #136
lslsr0, r0, #21
bx  lr
f3_2:
movwr0, #22136
movtr0, 4660
bx  lr

2020-08-28  Christophe Lyon  

PR target/96769
gcc/
* config/arm/thumb1.md: Move movsi splitter for
arm_disable_literal_pool after the other movsi splitters.

gcc/testsuite/
* gcc.target/arm/pure-code/pr96769.c: New test.
---
 gcc/config/arm/thumb1.md | 30 
 gcc/testsuite/gcc.target/arm/pure-code/pr96769.c | 20 
 2 files changed, 35 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/pr96769.c

diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index 602039e..4a59d87 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -64,21 +64,6 @@ (define_insn "thumb1_movsi_symbol_ref"
(set_attr "conds" "clob")]
 )
 
-(define_split
-  [(set (match_operand:SI 0 "register_operand" "")
-   (match_operand:SI 1 "immediate_operand" ""))]
-  "TARGET_THUMB1
-   && arm_disable_literal_pool
-   && GET_CODE (operands[1]) == CONST_INT
-   && !TARGET_HAVE_MOVT
-   && !satisfies_constraint_I (operands[1])"
-  [(clobber (const_int 0))]
-  "
-thumb1_gen_const_int (operands[0], INTVAL (operands[1]));
-DONE;
-  "
-)
-
 (define_insn "*thumb1_adddi3"
   [(set (match_operand:DI  0 "register_operand" "=l")
(plus:DI (match_operand:DI 1 "register_operand" "%0")
@@ -832,6 +817,21 @@ (define_split
   }"
 )
 
+(define_split
+  [(set (match_operand:SI 0 "register_operand" "")
+   (match_operand:SI 1 "const_int_operand" ""))]
+  "TARGET_THUMB1
+   && arm_disable_literal_pool
+   && GET_CODE (operands[1]) == CONST_INT
+   && !TARGET_HAVE_MOVT
+   && !satisfies_constraint_K (operands[1])"
+  [(clobber (const_int 0))]
+  "
+thumb1_gen_const_int (operands[0], INTVAL (operands[1]));
+DONE;
+  "
+)
+
 (define_insn "*thumb1_movhi_insn"
   [(set (match_operand:HI 0 "nonimmediate_operand" "=l,l,m,l*r,*h,l,r")
(match_operand:HI 1 "general_operand"   "l,m,l,k*h,*r,I,n"))]
diff --git a/gcc/testsuite/gcc.target/arm/pure-code/pr96769.c 
b/gcc/testsuite/gcc.target/arm/pure-code/pr96769.c
new file mode 100644
index 000..c90981f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pure-code/pr96769.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-mpure-code" } */
+
+int f3 (void) { return 0x1100; }
+int f3_2 (void) { return 0x12345678; }
+int f3_3 (void) { return -1; }
+int f3_4 (void) { return 511; }
+
+/* For cortex-m0 (thumb-1/v6m), we generate 1 lsls in f3 3 lsls in f3_2 and 1 
in f3_4; 1 rsbs in f3_3.  */
+/* { dg-final { scan-assembler-times "lsls" 5 { target { { ! 
arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
+/* { dg-final { scan-assembler-times "rsbs" 1 { target { { ! 
arm_thumb1_movt_ok } && { ! arm_thumb2_ok } } } } } */
+
+/* For cortex-m23 (thumb-1/v8m.base), we generate 1 lsls in f3, and none in 
f3_2 nor f3_4; 1 rsbs in f3_3.  */
+/* { dg-final { scan-assembler-times "lsls" 1 { target { arm_thumb1_movt_ok && 
{ ! arm_thumb2_ok } } } } } */
+/* { dg-final { scan-assembler-times "rsbs" 1 { target { arm_thumb1_movt_ok && 
{ ! arm_thumb2_ok } } } } } */
+
+/* For cortex-m3 (thumb-2/v7m), we generate no lsls and no rsbs.  */
+/* { dg-final { scan-assembler-times "lsls" 0 { target { arm_thumb2_ok } } } } 
*/
+/* { dg-final { scan-assembler-times "rsbs" 0 { target { arm_thumb2_ok } } } } 
*/
+
-- 
2.7.4



Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-03 Thread luoxhu via Gcc-patches



On 2020/9/2 17:30, Richard Biener wrote:
>> so maybe bypass convert_vector_to_array_for_subscript for special 
>> circumstance
>> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin
>> call a relative simpler method?
> I think you have it backward.  You need to work with what
> convert_vector_to_array_for_subscript
> gives and deal with it during RTL expansion / optimization to generate
> more optimal
> code for power.  The goal is to have as little target specific
> builtins during the GIMPLE
> optimization phase (because we cannot work out its semantics in optimizers).

OK, got it, will add optabs vec_insert and expand 
"VIEW_CONVERT_EXPR(v1)[_1] = i_6(D);"
expressions to rs6000_expand_vector_insert instead of builtin call.  
vec_extract already has optabs and "i = v[n%4]" should be in another patch
after this.


Thanks,
Xionghu



[PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches
This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication. 
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.

Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches
Attach patch file.

Feng

From: Gcc-patches  on behalf of Feng Xue OS 
via Gcc-patches 
Sent: Thursday, September 3, 2020 5:27 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop 
(PR 94234)

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication.
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.
From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - simplify
 plusminus-mult-with-convert in forwprop

For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of
multification, try to transform it to (T) (X +- Y), and apply simplification
on (X +- Y) if possible. In this way, we can avoid creating almost duplicated
rule to handle plusminus-mult-with-convert variant.

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
	function.
	(fwprop_ssa_val): Move it before its new caller.
	(pass_forwprop::execute): Add call to
	simplify_plusminus_mult_with_convert.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/testsuite/gcc.dg/pr94234-3.c |  42 
 gcc/tree-ssa-forwprop.c  | 168 +++
 2 files changed, 191 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index e2d008dfb92..7b9d46ec919 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name)
   return cfg_changed;
 }
 
+/* Primitive "lattice" function for gimple_simplify.  */
+
+static tree
+fwprop_ssa_val (tree name)
+{
+  /* First valueize NAME.  */
+  if (TREE_CODE (name) == SSA_NAME
+  && SSA_NAME_VERSION (name) < lattice.length ())
+{
+  tree val = lattice[SSA_NAME_VERSION (name)];
+  if (val)
+	name = val;
+}
+  /* We continue matching along SSA use-def edges for SSA names
+ that are not single-use.  Currently there are no patterns
+ that would cause any issues with that.  */
+  return name;
+}
+
 /* Return the rhs of a gassign *STMT in a form of a single tree,
converted to type TYPE.
 
@@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of
+   multiplication, if the expr can be transformed to (T) (X +- Y) in terms of
+   two's complement computation, apply simplification on (X +- Y) if it is
+   possible.  As a prerequisite, outer result type (T) has precision not more
+   than that of inner operand type.  */
+
+static bool
+simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree lhs = gimple_assign_lhs (stmt);
+  tree rtype = TREE_TYPE (lhs);
+  tree ctype = NULL_TREE;
+  enum tree_code code = gimple_assign_rhs_code (stmt);
+
+  if (code != PLUS_EXPR && code != MINUS_EXPR)
+return false;
+
+  /* Skip arithemtic operations that need to preserve overflow for trapping
+ or sanitization.  */
+  if (!INTEGRAL_TYPE

[PATCH] Optimize comparison between result of us_minus and 0.

2020-09-03 Thread Hongtao Liu via Gcc-patches
Hi:
  Add define_peephole2 to perform optimization like bellow:

+/* Optimize for TARGET_AVX512F
+  vpsubusw op1, op2, dst1;
+  vxorps xmm, xmm, dst2; >   vpcmpleuw op1, op2, dst3
+  vpcmpeqw dst1, dst2, dst3  */

and

+/* Optimize for target above TARGET_SSE4_1
+  vpsubusw op1, op2, dst1;  vpminuw op1, op2, dst1
+  vpxor xmm, xmm, dst2; >   vpcmpeqw op1, dst1, dst3
+  vpcmpeqw dst1, dst2, dst3  */

Bootstrap is ok, regression test is ok for i386/x86-64 backend.
Ok for trunk?

gcc/ChangeLog:
PR target/96906
* config/i386/sse.md (VI12_128_256): New mode iterator.
(define_peephole2): Optimize comparison between result of
us_minus and 0, it could be optimized to "vpcmplequ" for
AVX512 or "pminu + cmpeq" for target above TARGET_SSE4_1.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx2-pr96906-1.c: New test.
* gcc.target/i386/avx512f-pr96906-1.c: New test.
* gcc.target/i386/sse2-pr96906.c: New test.
* gcc.target/i386/sse4_1-pr96906-1.c: New test.

--
BR,
Hongtao
From dbfbd5350c9d12a0e5ca643cf9666d041d7d4744 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Thu, 3 Sep 2020 16:18:20 +0800
Subject: [PATCH] Optimize __builtin_ia32_psubusw128 compared to 0 to
 __builtin_ia32_pminuw128 compared to operand

gcc/ChangeLog:
	PR target/96906
	* config/i386/sse.md (VI12_128_256): New mode iterator.
	(define_peephole2): Optimize comparison between result of
	us_minus and 0, it could be optimized to "vpcmplequ" for
	AVX512 or "pminu + cmpeq" for target above TARGET_SSE4_1.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/avx2-pr96906-1.c: New test.
	* gcc.target/i386/avx512f-pr96906-1.c: New test.
	* gcc.target/i386/sse2-pr96906.c: New test.
	* gcc.target/i386/sse4_1-pr96906-1.c: New test.
---
 gcc/config/i386/sse.md| 63 +++
 .../gcc.target/i386/avx2-pr96906-1.c  | 17 +
 .../gcc.target/i386/avx512f-pr96906-1.c   | 40 
 gcc/testsuite/gcc.target/i386/sse2-pr96906.c  | 11 
 .../gcc.target/i386/sse4_1-pr96906-1.c| 11 
 5 files changed, 142 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-pr96906-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512f-pr96906-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse2-pr96906.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse4_1-pr96906-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8250325e1a3..60a571494d5 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -636,6 +636,7 @@ (define_mode_iterator VI124_128 [V16QI V8HI V4SI])
 (define_mode_iterator VI24_128 [V8HI V4SI])
 (define_mode_iterator VI248_128 [V8HI V4SI V2DI])
 (define_mode_iterator VI48_128 [V4SI V2DI])
+(define_mode_iterator VI12_128_256 [V16QI V32QI V8HI V16HI])
 
 ;; Various 256bit and 512 vector integer mode combinations
 (define_mode_iterator VI124_256 [V32QI V16HI V8SI])
@@ -12943,6 +12944,68 @@ (define_insn "sse2_gt3"
(set_attr "prefix" "orig,vex")
(set_attr "mode" "TI")])
 
+/* Optimize for TARGET_AVX512F
+  vpsubusw op1, op2, dst1;
+  vxorps xmm, xmm, dst2; >   vpcmpleuw op1, op2, dst3
+  vpcmpeqw dst1, dst2, dst3  */
+(define_peephole2
+  [(set (match_operand:VI12_AVX512VL 0 "register_operand")
+	(us_minus:VI12_AVX512VL
+	  (match_operand:VI12_AVX512VL 1 "register_operand")
+	  (match_operand:VI12_AVX512VL 2 "vector_operand")))
+   (set (match_operand:VI12_AVX512VL 3 "register_operand")
+	(match_operand:VI12_AVX512VL 4 "const0_operand"))
+   (set (match_operand: 5 "mask_reg_operand")
+	  (unspec:
+	[(match_operand:VI12_AVX512VL 6 "register_operand")
+	 (match_operand:VI12_AVX512VL 7 "register_operand")
+	 (const_int 0)]
+	UNSPEC_PCMP))]
+  "((rtx_equal_p (operands[0], operands[6])
+&& rtx_equal_p (operands[3], operands[7]))
+   || (rtx_equal_p (operands[0], operands[7])
+  && rtx_equal_p (operands[3], operands[6])))
+  && peep2_reg_dead_p (3, operands[0])
+  && peep2_reg_dead_p (3, operands[3])"
+  [(set (match_dup 5)
+	(unspec:
+	  [(match_dup 1)
+	   (match_dup 2)
+	   (const_int 2)] UNSPEC_UNSIGNED_PCMP))])
+
+/* Optimize for target above TARGET_SSE4_1
+  vpsubusw op1, op2, dst1;  vpminuw op1, op2, dst1
+  vpxor xmm, xmm, dst2; >   vpcmpeqw op1, dst1, dst3
+  vpcmpeqw dst1, dst2, dst3  */
+(define_peephole2
+  [(set (match_operand:VI12_128_256 0 "register_operand")
+	(us_minus:VI12_128_256
+	  (match_operand:VI12_128_256 1 "register_operand")
+	  (match_operand:VI12_128_256 2 "vector_operand")))
+   (set (match_operand:VI12_128_256 3 "register_operand")
+	(match_operand:VI12_128_256 4 "const0_operand"))
+   (set (match_operand:VI12_128_256 5 "register_operand")
+	(eq:VI12_128_256
+	  (match_operand:VI12_128_256 6 "register_operand")
+	   (match_operand:VI12_128_256 7 "register_operand")))]
+  "(TARGET_SSE4_1 || mode == V16QImode)
+  && ((rtx_equal_p (operands[0], operands[6])
+  && rtx_equal_p (operands[3], operands[7]))
+ || 

Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Alexandre Oliva
On Sep  2, 2020, Segher Boessenkool  wrote:

> Hi!
> On Tue, Sep 01, 2020 at 07:22:57PM -0300, Alexandre Oliva wrote:
>> It's ugly, I know.

> Yeah, it's bloody disgusting :-)

:-)

> (And aarch, but not the other targets with default clobbers).

And arm.  I didn't look very hard for others yet; I wasn't sure it would
be worth pursuing any further.

>> even special-casing empty asm patterns (at a slight risk of breaking
>> code that expects implicit clobbers even for empty asm patterns, but
>> empty asm patterns won't clobber flags, so how could it break
>> anything?).

> People write empty asm statements not because they would like no insns
> emitted from it, but *because* they want the other effects an asm has
> (for example, an empty asm usually has no outputs, so it is volatile,
> and that makes sure it is executed in the real machine exactly as often
> as in the abstract machine).  So your expectation might be wrong,
> someone might want an empty asm to clobber cc on x86 (like any asm is
> documented as doing).

I just can't imagine a case in which flags set by an earlier
instruction, and that could be usable by a subsequent instruction, could
possibly be clobbered by an *empty* sequence of insns.  Maybe I just
lack imagination about all the sorts of weird things that people use
such asm statements for...  Got any example of something that could be
broken by such a change to educate me?  Not that I need it, I'm just
curious.

> But how about a "none" clobber?

I like that!

>> I take this might be useful for do-nothing asm
>> statements, often used to stop certain optimizations, e.g.:
>> 
>> __typeof (*p) c = __builtin_add_overflow (*p, 1, p);
>> asm ("" : "+m" (*p)); // Make sure we write to memory.
>> *p += c; // This should compile into an add with carry.

> Wow, nasty.  That asm cannot be optimised away even if the rest is
> (unless GCC can somehow figure out nothing ever uses *p).  Is there no
> better way to do this?

I don't know.  There's a piece of hand-crafted x86 assembly that I'm
trying to find some way to replicate in compiler-optimized code.  I
don't know whether two RMWs could possibly be faster, but there are two
concerns driving me to try to use the same insn sequence:

- compactness; this is for per-block instrumentation, and if we use 4
  larger insns instead of 2 smaller ones, there might be even more
  significant impact on performance because of code cache issues

- threads: though the instrumentation code is definitely not
  thread-safe, and failing to expose the intermediate state should not
  have a significant impact, I wanted to at least be able to compare the
  use of code as in existing instrumentation code with
  compiler-generated instrumentation, and measure any performance
  differences and actual impact on multi-threaded programs.
  
>> Without the asm, we issue load;add;adc;store, which is not the ideal
>> sequence with add and adc to the same memory address (or two different
>> addresses, if the last statement uses say *q instead of *p).

> Is doing two RMWs on memory faster?  Huh.

I haven't really measured it (yet), I just assumed whoever put the asm
code there like that, who seemed to be both knowledgeable and concerned
about the performance impact on a very hot piece of instrumentation
code, knew what they were doing, but when I set out to check it myself,
I've hit this wall.  At first I though GCC was picking the sequence it
did because it was faster, but then, while I investigated, I came to the
conclusion that it picked it because it didn't stand a chance of doing
otherwise, not even if I pushed it really hard.


>> Alas, getting the first add to go straight to memory is more
>> complicated.  Even with the asm that forces the output to memory, the
>> output flag makes it harder to get it optimized to an add-to-memory
>> form.  When the output flag is unused, we optimize it enough in gimple
>> that TER does its job and we issue a single add, but that's not possible
>> when the two outputs of ADD_OVERFLOW are used: the flag setting gets
>> optimized away, but only after stopping combine from turning the
>> load/add/store into an add-to-memory.
>> 
>> If we retried the 3-insn substitution after substituting the flag store
>> into the add for the adc,

> combine should retry every combination if any of the input insns to it
> have changed (put another way, if any insn is changed all combinations
> with it are tried anew).

Alas, they don't change, it's an intervening flag-store that does.  When
we try the 3 insns that make up RMW, while the flag store is still there
before the W, it triggers one of the cant_combine tests.  After it's
gone, we don't reconsider.  Does that count as a bug to be filed?

> But.  Dependencies through memory are never used for combine (it uses
> dependencies through registers only), maybe that is what you are seeing?

No, it's just that the use of the flag between M and W prevents
substituting the M into the W, because then the use of t

Re: [PATCH] lto: Cache location_ts including BLOCKs in GIMPLE streaming [PR94311]

2020-09-03 Thread Richard Biener
On Thu, 3 Sep 2020, Jakub Jelinek wrote:

> Hi!
> 
> As mentioned in the PR, when compiling valgrind even on fairly small
> testcase where in one larger function the location keeps oscillating
> between a small line number and 8000-ish line number in the same file
> we very quickly run out of all possible location_t numbers and because of
> that emit non-sensical line numbers in .debug_line.
> There are ways how to decrease speed of depleting location_t numbers
> in libcpp, but the main reason of this is that we use
> stream_input_location_now for streaming in location_t for gimple_location
> and phi arg locations.  libcpp strongly prefers that the locations
> it is given are sorted by the different files and by line numbers in
> ascending order, otherwise it depletes quickly no matter what and is much
> more costly (many extra file changes etc.).
> The reason for not caching those were the BLOCKs that were streamed
> immediately after the location and encoded into the locations (and for PHIs
> we failed to stream the BLOCKs altogether).
> This patch enhances the location cache to handle also BLOCKs (but not for
> everything, only for the spots we care about the BLOCKs) and also optimizes
> the size of the LTO stream by emitting a single bit into a pack whether the
> BLOCK changed from last case and only streaming the BLOCK tree if it
> changed.
> 
> Bootstapped/regtested on x86_64-linux with
> ../configure --enable-languages=default,ada,obj-c++,lto,go,brig,d 
> --with-build-config=bootstrap-lto --enable-checking=yes,rtl,extra && make 
> -j16 bootstrap > LOG 2>&1 && GXX_TESTSUITE_STDS=98,11,14,17,2a make -j32 -k 
> check > LOGC 2>&1; ../contrib/test_summary > LOGT 2>&1
> as well as normally bootstrapped/regtested on x86_64-linux and i686-linux,
> ok for trunk?

LGTM.

Can we now remove stream_input_location_now?

> BTW, I wonder about one preexisting issue (but that can be done
> incrementally).  On the streamer out side, we call clear_line_info
> in multiple spots which resets the current_* values to something, but on the
> reader side, we don't have corresponding resets in the same location, just 
> have
> the stream_* static variables that keep the current values through the
> entire stream in (so across all the clear_line_info spots in a single LTO
> object but also across jumping from one LTO object to another one).
> Now, in an earlier version of my patch it actually broke LTO bootstrap
> (and a lot of LTO testcases), so for the BLOCK case I've solved it by
> clear_line_info setting current_block to something that should never appear,
> which means that in the LTO stream after the clear_line_info spots including
> the start of the LTO stream we force the block change bit to be set and thus
> BLOCK to be streamed and therefore stream_block from earlier to be
> ignored.  But for the rest I think that is not the case, so I wonder if we
> don't sometimes end up with wrong line/column info because of that, or
> please tell me what prevents that.
> clear_line_info does:
>   ob->current_file = NULL;
>   ob->current_line = 0;
>   ob->current_col = 0;
>   ob->current_sysp = false;
> while I think NULL current_file is something that should likely be different
> from expanded_location (...).file (UNKNOWN_LOCATION/BUILTINS_LOCATION are
> handled separately and not go through the caching), I think line number 0
> can sometimes occur and especially column 0 occurs frequently if we ran out
> of location_t with columns info.  But then we do:
>   bp_pack_value (bp, ob->current_file != xloc.file, 1);
>   bp_pack_value (bp, ob->current_line != xloc.line, 1);
>   bp_pack_value (bp, ob->current_col != xloc.column, 1);
> and stream the details only if the != is true.  If that happens immediately
> after clear_line_info and e.g. xloc.column is 0, we would stream 0 bit and
> not stream the actual value, so on read-in it would reuse whatever
> stream_col etc. were before.  Shouldn't we set some ob->current_* new bit
> that would signal we are immediately past clear_line_info which would force
> all these != checks to non-zero?  Either by oring something into those
> tests, or perhaps:
>   if (ob->current_reset)
> {
>   if (xloc.file == NULL)
>   ob->current_file = "";
>   if (xloc.line == 0)
>   ob->current_line = 1;
>   if (xloc.column == 0)
>   ob->current_column = 1;
>   ob->current_reset = false;
> }
> before doing those bp_pack_value calls with a comment, effectively forcing
> all 6 != comparisons to be true?

Yeah, guess that would be cleanest.  How does UNKNOWN_LOCATION
expand btw?  Using that as "reset" value also might work?

Thanks,
Richard.

> 2020-09-03  Jakub Jelinek  
> 
>   PR lto/94311
>   * gimple.h (gimple_location_ptr, gimple_phi_arg_location_ptr): New
>   functions.
>   * streamer-hooks.h (struct streamer_hooks): Add
>   output_location_and_block callback.  Fix up formatting for
>   output_location.
>   (stream_output_location_and

Re: [PATCH] lto: Cache location_ts including BLOCKs in GIMPLE streaming [PR94311]

2020-09-03 Thread Jakub Jelinek via Gcc-patches
On Thu, Sep 03, 2020 at 12:07:36PM +0200, Richard Biener wrote:
> LGTM.

Thanks.

> Can we now remove stream_input_location_now?

There are a few remaining users, one is:
  case LTO_ert_must_not_throw:
{
  r->type = ERT_MUST_NOT_THROW;
  r->u.must_not_throw.failure_decl = stream_read_tree (ib, data_in);
  bitpack_d bp = streamer_read_bitpack (ib);
  r->u.must_not_throw.failure_loc
   = stream_input_location_now (&bp, data_in);
}
and the other two:
  /* Input the function start and end loci.  */
  fn->function_start_locus = stream_input_location_now (&bp, data_in);
  fn->function_end_locus = stream_input_location_now (&bp, data_in);
I know nothing about those, so have kept them as is.  I don't know if
failure_loc can or can't include blocks, what is the reason why it uses
*_now (can r be reallocated?) and similarly about the function start/end.
Will defer these to Honza or anybody else who knows LTO better.

Ok, looking some more, seems r isn't really reallocated and failure_loc
shouldn't really include blocks, judging from
case ERT_MUST_NOT_THROW:
  new_r->u.must_not_throw.failure_loc =
LOCATION_LOCUS (old_r->u.must_not_throw.failure_loc);
and
  this_region->u.must_not_throw.failure_loc
= LOCATION_LOCUS (gimple_location (tp));
as the only setters of failure_loc outside of lto-streamer-in.c.
Dunno if it would be sufficient to just use normal stream_input_location
and let the apply cache after all EH regions and later all bbs are input
apply it.

> > all these != checks to non-zero?  Either by oring something into those
> > tests, or perhaps:
> >   if (ob->current_reset)
> > {
> >   if (xloc.file == NULL)
> > ob->current_file = "";
> >   if (xloc.line == 0)
> > ob->current_line = 1;
> >   if (xloc.column == 0)
> > ob->current_column = 1;
> >   ob->current_reset = false;
> > }
> > before doing those bp_pack_value calls with a comment, effectively forcing
> > all 6 != comparisons to be true?
> 
> Yeah, guess that would be cleanest.  How does UNKNOWN_LOCATION
> expand btw?  Using that as "reset" value also might work?

On the lto-streamer-out.c, for UNKNOWN_LOCATION/BUILTINS_LOCATION we just
do:
  bp_pack_int_in_range (bp, 0, RESERVED_LOCATION_COUNT,
loc < RESERVED_LOCATION_COUNT
? loc : RESERVED_LOCATION_COUNT);
and that is it (well, for locs with blocks I do more now, but
ob->current_{file,line,column,sysp} aren't touched).

I'll test a patch.

Jakub



Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-03 Thread Richard Biener via Gcc-patches
On Thu, Sep 3, 2020 at 11:20 AM luoxhu  wrote:
>
>
>
> On 2020/9/2 17:30, Richard Biener wrote:
> >> so maybe bypass convert_vector_to_array_for_subscript for special 
> >> circumstance
> >> like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert 
> >> builtin
> >> call a relative simpler method?
> > I think you have it backward.  You need to work with what
> > convert_vector_to_array_for_subscript
> > gives and deal with it during RTL expansion / optimization to generate
> > more optimal
> > code for power.  The goal is to have as little target specific
> > builtins during the GIMPLE
> > optimization phase (because we cannot work out its semantics in optimizers).
>
> OK, got it, will add optabs vec_insert and expand 
> "VIEW_CONVERT_EXPR(v1)[_1] = i_6(D);"
> expressions to rs6000_expand_vector_insert instead of builtin call.
> vec_extract already has optabs and "i = v[n%4]" should be in another patch
> after this.

There is already vec_set and vec_extract - the question is whether the expander
tries those for variable index.

Richard.

>
> Thanks,
> Xionghu
>


Re: [PATCH] lto: Cache location_ts including BLOCKs in GIMPLE streaming [PR94311]

2020-09-03 Thread Richard Biener
On Thu, 3 Sep 2020, Jakub Jelinek wrote:

> On Thu, Sep 03, 2020 at 12:07:36PM +0200, Richard Biener wrote:
> > LGTM.
> 
> Thanks.
> 
> > Can we now remove stream_input_location_now?
> 
> There are a few remaining users, one is:
>   case LTO_ert_must_not_throw:
> {
>   r->type = ERT_MUST_NOT_THROW;
>   r->u.must_not_throw.failure_decl = stream_read_tree (ib, data_in);
>   bitpack_d bp = streamer_read_bitpack (ib);
>   r->u.must_not_throw.failure_loc
>= stream_input_location_now (&bp, data_in);
> }
> and the other two:
>   /* Input the function start and end loci.  */
>   fn->function_start_locus = stream_input_location_now (&bp, data_in);
>   fn->function_end_locus = stream_input_location_now (&bp, data_in);
> I know nothing about those, so have kept them as is.

I think it's safe to share locations with the rest of the IL and the EH
tree so just replacing them should be fine.

>  I don't know if
> failure_loc can or can't include blocks, what is the reason why it uses
> *_now (can r be reallocated?) and similarly about the function start/end.
> Will defer these to Honza or anybody else who knows LTO better.
> 
> Ok, looking some more, seems r isn't really reallocated and failure_loc
> shouldn't really include blocks, judging from
> case ERT_MUST_NOT_THROW:
>   new_r->u.must_not_throw.failure_loc =
> LOCATION_LOCUS (old_r->u.must_not_throw.failure_loc);
> and
>   this_region->u.must_not_throw.failure_loc
> = LOCATION_LOCUS (gimple_location (tp));
> as the only setters of failure_loc outside of lto-streamer-in.c.
> Dunno if it would be sufficient to just use normal stream_input_location
> and let the apply cache after all EH regions and later all bbs are input
> apply it.

I think so.

> > > all these != checks to non-zero?  Either by oring something into those
> > > tests, or perhaps:
> > >   if (ob->current_reset)
> > > {
> > >   if (xloc.file == NULL)
> > >   ob->current_file = "";
> > >   if (xloc.line == 0)
> > >   ob->current_line = 1;
> > >   if (xloc.column == 0)
> > >   ob->current_column = 1;
> > >   ob->current_reset = false;
> > > }
> > > before doing those bp_pack_value calls with a comment, effectively forcing
> > > all 6 != comparisons to be true?
> > 
> > Yeah, guess that would be cleanest.  How does UNKNOWN_LOCATION
> > expand btw?  Using that as "reset" value also might work?
> 
> On the lto-streamer-out.c, for UNKNOWN_LOCATION/BUILTINS_LOCATION we just
> do:
>   bp_pack_int_in_range (bp, 0, RESERVED_LOCATION_COUNT,
> loc < RESERVED_LOCATION_COUNT
> ? loc : RESERVED_LOCATION_COUNT);
> and that is it (well, for locs with blocks I do more now, but
> ob->current_{file,line,column,sysp} aren't touched).
> 
> I'll test a patch.

Thanks,
Richard.


[PATCH] Improve constant folding of vector lowering with vector bools

2020-09-03 Thread Richard Biener
This improves the situation somewhat when vector lowering tries
to access vector bools as seen in PR96814.

We still fail to fold VIEW_CONVERT { 0, -1, -1, ... } for
vector bools with integer mode.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

2020-09-03  Richard Biener  

* tree-vect-generic.c (tree_vec_extract): Remove odd
special-casing of boolean vectors.
* fold-const.c (fold_ternary_loc): Handle boolean vector
type BIT_FIELD_REFs.
---
 gcc/fold-const.c|  4 +++-
 gcc/tree-vect-generic.c | 14 +-
 2 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 1f861630225..0cc80adf632 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -12581,7 +12581,9 @@ fold_ternary_loc (location_t loc, enum tree_code code, 
tree type,
  && tree_fits_uhwi_p (op2))
{
  tree eltype = TREE_TYPE (TREE_TYPE (arg0));
- unsigned HOST_WIDE_INT width = tree_to_uhwi (TYPE_SIZE (eltype));
+ unsigned HOST_WIDE_INT width
+   = (TREE_CODE (eltype) == BOOLEAN_TYPE
+  ? TYPE_PRECISION (eltype) : tree_to_uhwi (TYPE_SIZE (eltype)));
  unsigned HOST_WIDE_INT n = tree_to_uhwi (arg1);
  unsigned HOST_WIDE_INT idx = tree_to_uhwi (op2);
 
diff --git a/gcc/tree-vect-generic.c b/gcc/tree-vect-generic.c
index 6d5d65195ae..d7bafa77134 100644
--- a/gcc/tree-vect-generic.c
+++ b/gcc/tree-vect-generic.c
@@ -136,19 +136,7 @@ tree_vec_extract (gimple_stmt_iterator *gsi, tree type,
t = gimple_assign_rhs1 (def_stmt);
 }
   if (bitpos)
-{
-  if (TREE_CODE (type) == BOOLEAN_TYPE)
-   {
- tree itype
-   = build_nonstandard_integer_type (tree_to_uhwi (bitsize), 0);
- tree field = gimplify_build3 (gsi, BIT_FIELD_REF, itype, t,
-   bitsize, bitpos);
- return gimplify_build2 (gsi, NE_EXPR, type, field,
- build_zero_cst (itype));
-   }
-  else
-   return gimplify_build3 (gsi, BIT_FIELD_REF, type, t, bitsize, bitpos);
-}
+return gimplify_build3 (gsi, BIT_FIELD_REF, type, t, bitsize, bitpos);
   else
 return gimplify_build1 (gsi, VIEW_CONVERT_EXPR, type, t);
 }
-- 
2.26.2


Re: [PATCH] test/rs6000: Replace test target p8 and p9+

2020-09-03 Thread Segher Boessenkool
Hi!

On Tue, Sep 01, 2020 at 11:19:55AM +0800, Kewen.Lin wrote:
> This is a trivial patch to clean existing rs6000 test targets
> p8 and p9+ with existing has_arch_pwr8 and has_arch_pwr9
> target combination or only one of them.  Not sure if it's a
> good idea to tidy this, but send out for comments.

The has_arch_xxx is a lot easier to understand.  The "p9+" thing would
become especially un-nice if there was an actual processor called
"Power9+" (there were for p4, p5, p7 at least, so it is a practical
concern).

>   * gcc.target/powerpc/pr92398.p9+.c: Replace p9+ with has_arch_pwr9.
>   * gcc.target/powerpc/pr92398.p9-.c: Replace p9+ with has_arch_pwr9,
>   and replace p8 with has_arch_pwr8 && !has_arch_pwr9.
>   * lib/target-supports.exp (check_effective_target_p8): Remove.
>   (check_effective_target_p9+): Remove.

> +/* { dg-final { scan-assembler-times {\mstd\M} 2 { xfail { { {! 
> has_arch_pwr9} && has_arch_pwr8 } && be } } } } */

Not sure you need all braces here.  But this isn't pure tcl, some of it
is parsed by routines in dejagnu (which allow much less), so who knows.

Okay for trunk.  Thanks!


Segher


[committed] libstdc++: Optimise GCD algorithms

2020-09-03 Thread Jonathan Wakely via Gcc-patches
The current std::gcd and std::chrono::duration::_S_gcd algorithms are
both recursive. This is potentially expensive to evaluate in constant
expressions, because each level of recursion makes a new copy of the
function to evaluate. The maximum number of steps is bounded
(proportional to the number of decimal digits in the smaller value) and
so unlikely to exceed the limit for constexpr nesting, but the memory
usage is still suboptimal. By using an iterative algorithm we avoid
that compile-time cost. Because looping in constexpr functions is not
allowed until C++14, we need to keep the recursive implementation in
duration::_S_gcd for C++11 mode.

For std::gcd we can also optimise runtime performance by using the
binary GCD algorithm.

libstdc++-v3/ChangeLog:

* include/std/chrono (duration::_S_gcd): Use iterative algorithm
for C++14 and later.
* include/std/numeric (__detail::__gcd): Replace recursive
Euclidean algorithm with iterative version of binary GCD algorithm.
* testsuite/26_numerics/gcd/1.cc: Test additional inputs.
* testsuite/26_numerics/gcd/gcd_neg.cc: Adjust dg-error lines.
* testsuite/26_numerics/lcm/lcm_neg.cc: Likewise.
* testsuite/experimental/numeric/gcd.cc: Test additional inputs.
* testsuite/26_numerics/gcd/2.cc: New test.

Tested powerpc64le-linux. Committed to trunk.

commit 3c219134152f645103f2fcd50735b177ccd76cde
Author: Jonathan Wakely 
Date:   Thu Sep 3 12:38:50 2020

libstdc++: Optimise GCD algorithms

The current std::gcd and std::chrono::duration::_S_gcd algorithms are
both recursive. This is potentially expensive to evaluate in constant
expressions, because each level of recursion makes a new copy of the
function to evaluate. The maximum number of steps is bounded
(proportional to the number of decimal digits in the smaller value) and
so unlikely to exceed the limit for constexpr nesting, but the memory
usage is still suboptimal. By using an iterative algorithm we avoid
that compile-time cost. Because looping in constexpr functions is not
allowed until C++14, we need to keep the recursive implementation in
duration::_S_gcd for C++11 mode.

For std::gcd we can also optimise runtime performance by using the
binary GCD algorithm.

libstdc++-v3/ChangeLog:

* include/std/chrono (duration::_S_gcd): Use iterative algorithm
for C++14 and later.
* include/std/numeric (__detail::__gcd): Replace recursive
Euclidean algorithm with iterative version of binary GCD algorithm.
* testsuite/26_numerics/gcd/1.cc: Test additional inputs.
* testsuite/26_numerics/gcd/gcd_neg.cc: Adjust dg-error lines.
* testsuite/26_numerics/lcm/lcm_neg.cc: Likewise.
* testsuite/experimental/numeric/gcd.cc: Test additional inputs.
* testsuite/26_numerics/gcd/2.cc: New test.

diff --git a/libstdc++-v3/include/std/chrono b/libstdc++-v3/include/std/chrono
index 1682263fd9f..0e2efb2522b 100644
--- a/libstdc++-v3/include/std/chrono
+++ b/libstdc++-v3/include/std/chrono
@@ -428,8 +428,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
_S_gcd(intmax_t __m, intmax_t __n) noexcept
{
  // Duration only allows positive periods so we don't need to
- // support negative values here (unlike __static_gcd and std::gcd).
+ // handle negative values here (unlike __static_gcd and std::gcd).
+#if __cplusplus >= 201402L
+ while (__m != 0 && __n != 0)
+   {
+ intmax_t __rem = __m % __n;
+ __m = __n;
+ __n = __rem;
+   }
+ return __m + __n;
+#else
+ // C++11 doesn't allow loops in constexpr functions, but this
+ // recursive version can be more expensive to evaluate.
  return (__m == 0) ? __n : (__n == 0) ? __m : _S_gcd(__n, __m % __n);
+#endif
}
 
// _GLIBCXX_RESOLVE_LIB_DEFECTS
diff --git a/libstdc++-v3/include/std/numeric b/libstdc++-v3/include/std/numeric
index bd70a52019b..2de6aaf06ec 100644
--- a/libstdc++-v3/include/std/numeric
+++ b/libstdc++-v3/include/std/numeric
@@ -76,6 +76,7 @@
 
 #if __cplusplus >= 201402L
 #include 
+#include 
 
 namespace std _GLIBCXX_VISIBILITY(default)
 {
@@ -97,15 +98,40 @@ namespace __detail
 
   template void __absu(bool) = delete;
 
-  // GCD implementation
+  // GCD implementation, using Stein's algorithm
   template
 constexpr _Tp
 __gcd(_Tp __m, _Tp __n)
 {
   static_assert(is_unsigned<_Tp>::value, "type must be unsigned");
-  return __m == 0 ? __n
-   : __n == 0 ? __m
-   : __detail::__gcd(__n, _Tp(__m % __n));
+
+  if (__m == 0)
+   return __n;
+  if (__n == 0)
+   return __m;
+
+  const int __i = std::__countr_zero(__m);
+  __m >>= __i;
+  const int __j = std::__countr_zero(__n);
+  __n >>= __j;
+  const int __k = __i < __j ? __i : __j; // min(i, j)
+

ubsan: d-demangle.c:214 signed integer overflow

2020-09-03 Thread Alan Modra via Gcc-patches
Running the libiberty testsuite
./test-demangle < libiberty/testsuite/d-demangle-expected
libiberty/d-demangle.c:214:14: runtime error: signed integer overflow: 
922337203 * 10 cannot be represented in type 'long int'

On looking at silencing ubsan, I found a real bug in dlang_number.
For a 32-bit long, some overflows won't be detected.  For example,
21474836480.  Why?  Well 214748364 * 10 is 0x7FF8 (no overflow so
far).  Adding 8 gives 0x8000 (which does overflow but there is no
test for that overflow in the code).  Then multiplying 0x8000 * 10
= 0x5 = 0 won't be caught by the multiplication overflow test.
The same holds for a 64-bit long using similarly crafted digit
sequences.

This patch replaces the mod 10 test with a simpler limit test, and
similarly the mod 26 test in dlang_decode_backref.

About the limit test:
  val * 10 + digit > ULONG_MAX is the condition for overflow
ie.
  val * 10 > ULONG_MAX - digit
or
  val > (ULONG_MAX - digit) / 10
or assuming the largest digit
  val > (ULONG_MAX - 9) / 10

I resisted the aesthetic appeal of simplifying this further to
  val > -10UL / 10
since -1UL for ULONG_MAX is only correct for 2's complement numbers.

Passes all the libiberty tests, on both 32-bit and 64-bit hosts.  OK
to apply?

* d-demangle.c: Include limits.h.
(ULONG_MAX): Provide fall-back definition.
(dlang_number): Simplify and correct overflow test.  Only
write *ret on returning non-NULL.
(dlang_decode_backref): Likewise.

diff --git a/libiberty/d-demangle.c b/libiberty/d-demangle.c
index f2d6946eca..59e6ae007a 100644
--- a/libiberty/d-demangle.c
+++ b/libiberty/d-demangle.c
@@ -31,6 +31,9 @@ If not, see .  */
 #ifdef HAVE_CONFIG_H
 #include "config.h"
 #endif
+#ifdef HAVE_LIMITS_H
+#include 
+#endif
 
 #include "safe-ctype.h"
 
@@ -45,6 +48,10 @@ If not, see .  */
 #include 
 #include "libiberty.h"
 
+#ifndef ULONG_MAX
+#defineULONG_MAX   (~0UL)
+#endif
+
 /* A mini string-handling package */
 
 typedef struct string  /* Beware: these aren't required to be */
@@ -207,24 +214,24 @@ dlang_number (const char *mangled, long *ret)
   if (mangled == NULL || !ISDIGIT (*mangled))
 return NULL;
 
-  (*ret) = 0;
+  unsigned long val = 0;
 
   while (ISDIGIT (*mangled))
 {
-  (*ret) *= 10;
-
-  /* If an overflow occured when multiplying by ten, the result
-will not be a multiple of ten.  */
-  if ((*ret % 10) != 0)
+  /* Check for overflow.  Yes, we return NULL here for some digits
+that don't overflow "val * 10 + digit", but that doesn't
+matter given the later "(long) val < 0" test.  */
+  if (val > (ULONG_MAX - 9) / 10)
return NULL;
 
-  (*ret) += mangled[0] - '0';
+  val = val * 10 + mangled[0] - '0';
   mangled++;
 }
 
-  if (*mangled == '\0' || *ret < 0)
+  if (*mangled == '\0' || (long) val < 0)
 return NULL;
 
+  *ret = val;
   return mangled;
 }
 
@@ -294,24 +301,24 @@ dlang_decode_backref (const char *mangled, long *ret)
[A-Z] NumberBackRef
^
*/
-  (*ret) = 0;
+  unsigned long val = 0;
 
   while (ISALPHA (*mangled))
 {
-  (*ret) *= 26;
+  /* Check for overflow.  */
+  if (val > (ULONG_MAX - 25) / 26)
+   break;
 
-  /* If an overflow occured when multiplying by 26, the result
-will not be a multiple of 26.  */
-  if ((*ret % 26) != 0)
-   return NULL;
+  val *= 26;
 
   if (mangled[0] >= 'a' && mangled[0] <= 'z')
{
- (*ret) += mangled[0] - 'a';
+ val += mangled[0] - 'a';
+ *ret = val;
  return mangled + 1;
}
 
-  (*ret) += mangled[0] - 'A';
+  val += mangled[0] - 'A';
   mangled++;
 }
 
-- 
Alan Modra
Australia Development Lab, IBM


RFC: --enable-link-serialization support

2020-09-03 Thread Jakub Jelinek via Gcc-patches
Hi!

As discussed earlier on IRC, --enable-link-mutex has a big problem that it
decreases number of available jobs by the number of link commands waiting
for the lock, so e.g. when doing make -j32 build with 11 different
big programs linked with $(LLINKER) we end up with just 22 effective jobs,
and with e.g. make -j8 with those 11 different big programs we actually most
likely serialize everything during linking onto a single job.

The following patch is an attempt to handle it using Make dependencies
instead, let configure create a chain, e.g.
c++.prev: c.serial
fortran.prev: c++.serial
lto.prev: fortran.serial
in Make-hooks, which says that cc1plus can be only linked after cc1 is
linked, then f951, then lto1 and then lto-dump.

One thing I don't like very much on this implementation is that it uses
recursive make, something we apparently don't use in gcc/ right now.

One reason for that is that I wanted to make the case when I do:
make -j32 cc1plus
by hand to avoid uselessly building cc1, gnat1 etc. (whatever is in
the CONFIG_LANGUAGES before), i.e. only serialize if done from a bootstrap.

But perhaps there could be a way out of that, by moving the
--enable-link-serialization to toplevel configure and arrange for it to just
pass some variable to the gcc make in that case.  Then the Make-hooks could
look like:
ifeq (DO_LINK_SERIALIZATION,true)
ada.prev: c.serial
brig.prev: ada.serial
c++.prev: brig.serial
d.prev: c++.serial
fortran.prev: d.serial
go.prev: fortran.serial
lto.prev: go.serial
objc.prev: lto.serial
obj-c++.prev: objc.serial
else
ada.prev:
brig.prev:
c++.prev:
d.prev:
fortran.prev:
go.prev:
lto.prev:
objc.prev:
obj-c++.prev:
endif
and then just add c++.prev dependency to cc1obj$(exeext) dependencies
unconditionally.
Thoughts on that?

2020-09-03  Jakub Jelinek  

gcc/
* configure.ac: Add --enable-link-serialization support, add into
Make-hooks rules on which language is previous in the list of
all selected languages.
* Makefile.in (LANGUAGES): Add *.serial instead of * if
--enable-link-serialization is on.
* configure: Regenerated.
gcc/c/
* Make-lang.in (c.serial): Depend on c.
(.PHONY): Add c.serial.
gcc/cp/
* Make-lang.in (.PHONY): Add c++.serial c++.prev.
(CC1PLUS_DEPS): New variable.
(cc1plus$(exeext)): Use it.
(c++.serial): New goal.
gcc/ada/
* Make-lang.in (.PHONY): Add ada.serial ada.prev.
(GNAT1_DEPS): New variable.
(gnat1$(exeext)): Use it.
(ada.serial): New goal.
gcc/brig/
* Make-lang.in (.PHONY): Add brig.serial brig.prev.
(BRIG1_DEPS): New variable.
(brig1$(exeext)): Use it.
(brig.serial): New goal.
gcc/go/
* Make-lang.in (.PHONY): Add go.serial go.prev.
(GO1_DEPS): New variable.
(go1$(exeext)): Use it.
(go.serial): New goal.
gcc/jit/
* Make-lang.in (.PHONY): Add jit.serial jit.prev jit.extra.
(jit.extra): New goal.
(jit): Use it.
(LIBGCCJIT_FILENAME_DEPS): New variable.
($(LIBGCCJIT_FILENAME)): Use it.
(jit.serial): New goal.
gcc/objc/
* Make-lang.in (.PHONY): Add objc.serial objc.prev.
(CC1OBJ_DEPS): New variable.
(cc1obj$(exeext)): Use it.
(objc.serial): New goal.
gcc/objcp/
* Make-lang.in (.PHONY): Add obj-c++.serial obj-c++.prev.
(CC1OBJPLUS_DEPS): New variable.
(cc1objplus$(exeext)): Use it.
(obj-c++.serial): New goal.
gcc/fortran/
* Make-lang.in (.PHONY): Add fortran.serial fortran.prev.
(F951_DEPS): New variable.
(f951$(exeext)): Use it.
(fortran.serial): New goal.
gcc/lto/
* Make-lang.in (.PHONY): Add lto lto.serial lto.prev.
(lto): Depend on $(LTO_EXE) $(LTO_DUMP_EXE).
(lto.all-cross, lto.start.encap): Don't depend on anything.
(LTO_EXE_DEPS): New variable.
($(LTO_EXE)): Use it.
(LTO_DUMP_EXE_DEPS): New variable.
($(LTO_DUMP_EXE)): Use it.
(lto.serial): New goal.

--- gcc/configure.ac.jj 2020-08-26 17:09:45.829254709 +0200
+++ gcc/configure.ac2020-09-03 14:41:02.884746491 +0200
@@ -6380,6 +6380,24 @@ else
 fi
 AC_SUBST(DO_LINK_MUTEX)
 
+dnl Another way of doing that, purely using make dependencies
+
+AC_MSG_CHECKING([whether to serialize linking of multiple front-ends])
+  AC_ARG_ENABLE(link-serialization,
+[AS_HELP_STRING([--enable-link-serialization],
+   [avoid linking multiple front-ends at once to avoid thrashing
+on the build machine through make dependencies])],
+  do_link_serialization=$enableval,
+  do_link_serialization=no)
+AC_MSG_RESULT($do_link_serialization)
+
+if test "$do_link_serialization" = "yes"; then
+   DO_LINK_SERIALIZATION=true
+else
+   DO_LINK_SERIALIZATION=false
+fi
+AC_SUBST(DO_LINK_SERIALIZATION)
+
 # --
 # Language hooks
 # --
@@ -6540,6 +6558,14 @@ do
echo "lang.$t: $x" >> Make

Re: RFC: --enable-link-serialization support

2020-09-03 Thread Richard Biener
On Thu, 3 Sep 2020, Jakub Jelinek wrote:

> Hi!
> 
> As discussed earlier on IRC, --enable-link-mutex has a big problem that it
> decreases number of available jobs by the number of link commands waiting
> for the lock, so e.g. when doing make -j32 build with 11 different
> big programs linked with $(LLINKER) we end up with just 22 effective jobs,
> and with e.g. make -j8 with those 11 different big programs we actually most
> likely serialize everything during linking onto a single job.
> 
> The following patch is an attempt to handle it using Make dependencies
> instead, let configure create a chain, e.g.
> c++.prev: c.serial
> fortran.prev: c++.serial
> lto.prev: fortran.serial
> in Make-hooks, which says that cc1plus can be only linked after cc1 is
> linked, then f951, then lto1 and then lto-dump.
> 
> One thing I don't like very much on this implementation is that it uses
> recursive make, something we apparently don't use in gcc/ right now.
> 
> One reason for that is that I wanted to make the case when I do:
> make -j32 cc1plus
> by hand to avoid uselessly building cc1, gnat1 etc. (whatever is in
> the CONFIG_LANGUAGES before), i.e. only serialize if done from a bootstrap.

But is that an issue in practice?  I usually do not do make -j32 cc1plus
in a tree that was configured for bootstrap, nor do I use
--enable-link-serialization in that case.

Btw, do you plan to keep --enable-link-mutex the way it is?  What it
provides ontop of --enable-link-serialization is that progress
metering for very long running LTO link stages.  Not sure if that
works for the last LTO link or only when there's some other link
waiting.

No review on the actual patch - it looks like what I'd have tried
but I'm no make expert ;)

There's a bug about the bad --enable-link-mutex behavior btw (but
you don't fix that, just provide an alternate implementation).

Thanks,
Richard.

> But perhaps there could be a way out of that, by moving the
> --enable-link-serialization to toplevel configure and arrange for it to just
> pass some variable to the gcc make in that case.  Then the Make-hooks could
> look like:
> ifeq (DO_LINK_SERIALIZATION,true)
> ada.prev: c.serial
> brig.prev: ada.serial
> c++.prev: brig.serial
> d.prev: c++.serial
> fortran.prev: d.serial
> go.prev: fortran.serial
> lto.prev: go.serial
> objc.prev: lto.serial
> obj-c++.prev: objc.serial
> else
> ada.prev:
> brig.prev:
> c++.prev:
> d.prev:
> fortran.prev:
> go.prev:
> lto.prev:
> objc.prev:
> obj-c++.prev:
> endif
> and then just add c++.prev dependency to cc1obj$(exeext) dependencies
> unconditionally.
> Thoughts on that?
> 
> 2020-09-03  Jakub Jelinek  
> 
> gcc/
>   * configure.ac: Add --enable-link-serialization support, add into
>   Make-hooks rules on which language is previous in the list of
>   all selected languages.
>   * Makefile.in (LANGUAGES): Add *.serial instead of * if
>   --enable-link-serialization is on.
>   * configure: Regenerated.
> gcc/c/
>   * Make-lang.in (c.serial): Depend on c.
>   (.PHONY): Add c.serial.
> gcc/cp/
>   * Make-lang.in (.PHONY): Add c++.serial c++.prev.
>   (CC1PLUS_DEPS): New variable.
>   (cc1plus$(exeext)): Use it.
>   (c++.serial): New goal.
> gcc/ada/
>   * Make-lang.in (.PHONY): Add ada.serial ada.prev.
>   (GNAT1_DEPS): New variable.
>   (gnat1$(exeext)): Use it.
>   (ada.serial): New goal.
> gcc/brig/
>   * Make-lang.in (.PHONY): Add brig.serial brig.prev.
>   (BRIG1_DEPS): New variable.
>   (brig1$(exeext)): Use it.
>   (brig.serial): New goal.
> gcc/go/
>   * Make-lang.in (.PHONY): Add go.serial go.prev.
>   (GO1_DEPS): New variable.
>   (go1$(exeext)): Use it.
>   (go.serial): New goal.
> gcc/jit/
>   * Make-lang.in (.PHONY): Add jit.serial jit.prev jit.extra.
>   (jit.extra): New goal.
>   (jit): Use it.
>   (LIBGCCJIT_FILENAME_DEPS): New variable.
>   ($(LIBGCCJIT_FILENAME)): Use it.
>   (jit.serial): New goal.
> gcc/objc/
>   * Make-lang.in (.PHONY): Add objc.serial objc.prev.
>   (CC1OBJ_DEPS): New variable.
>   (cc1obj$(exeext)): Use it.
>   (objc.serial): New goal.
> gcc/objcp/
>   * Make-lang.in (.PHONY): Add obj-c++.serial obj-c++.prev.
>   (CC1OBJPLUS_DEPS): New variable.
>   (cc1objplus$(exeext)): Use it.
>   (obj-c++.serial): New goal.
> gcc/fortran/
>   * Make-lang.in (.PHONY): Add fortran.serial fortran.prev.
>   (F951_DEPS): New variable.
>   (f951$(exeext)): Use it.
>   (fortran.serial): New goal.
> gcc/lto/
>   * Make-lang.in (.PHONY): Add lto lto.serial lto.prev.
>   (lto): Depend on $(LTO_EXE) $(LTO_DUMP_EXE).
>   (lto.all-cross, lto.start.encap): Don't depend on anything.
>   (LTO_EXE_DEPS): New variable.
>   ($(LTO_EXE)): Use it.
>   (LTO_DUMP_EXE_DEPS): New variable.
>   ($(LTO_DUMP_EXE)): Use it.
>   (lto.serial): New goal.
> 
> --- gcc/configure.ac.jj   2020-08-26 17:09:45.829254709 +0200

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches
Hi,

Per request, I collected runtime performance data and code size data with 
CPU2017 on a X86 platform. 

*** Machine info:
model name>-: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
$ lscpu | grep NUMA
NUMA node(s):  2
NUMA node0 CPU(s): 0-21,44-65
NUMA node1 CPU(s): 22-43,66-87

***CPU2017 benchmarks: 
all the benchmarks with C/C++, 9 Integer benchmarks, 10 FP benchmarks. 

***Configures:
Intrate and fprate, 22 copies. 

***Compiler options:
no :-g -O2 -march=native
used_gpr_arg:   no + -fzero-call-used-regs=used-gpr-arg
used_arg:   no + -fzero-call-used-regs=used-arg
all_arg:no + -fzero-call-used-regs=all-arg
used_gpr:   no + -fzero-call-used-regs=used-gpr
all_gpr:no + -fzero-call-used-regs=all-gpr
used:   no + -fzero-call-used-regs=used
all:no + -fzero-call-used-regs=all

***each benchmark runs 3 times. 

***runtime performance data:
Please see the attached csv file


From the data, we can see that:
On average, all the options starting with “used_…”  (i.e, only the registers 
that are used in the routine will be zeroed) have very low runtime overheads, 
at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
If all the registers will be zeroed, the runtime overhead is bigger, all_arg is 
5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on average. 
Looks like the overhead of zeroing vector registers is much bigger. 

For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
runtime overhead with this is very small.

***code size increase data:

Please see the attached file 


From the data, we can see that:
The code size impact in general is very small, the biggest is “all_arg”, which 
is 1.06% for integer benchmark, and 1.13% for FP benchmarks.

So, from the data collected, I think that the run-time overhead and code size 
increase from this option are very reasonable. 

Let me know you comments and opinions.

thanks.

Qing

> On Aug 25, 2020, at 4:54 PM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> 
>> On Aug 24, 2020, at 3:20 PM, Segher Boessenkool  
>> wrote:
>> 
>> Hi!
>> 
>> On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
 On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
  wrote:
 On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
>> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>>  wrote:
>> Numbers on how expensive this is (for what arch, in code size and in
>> execution time) would be useful.  If it is so expensive that no one will
>> use it, it helps security at most none at all :-(
 
 Without numbers on this, no one can determine if it is a good tradeoff
 for them.  And we (the GCC people) cannot know if it will be useful for
 enough users that it will be worth the effort for us.  Which is why I
 keep hammering on this point.
>>> I can collect some run-time overhead data on this, do you have a 
>>> recommendation on what test suite I can use
>>> For this testing? (Is CPU2017 good enough)?
>> 
>> I would use something more real-life, not 12 small pieces of code.
> 
> There is some basic information about the benchmarks of CPU2017 in below link:
> 
> https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/overview.html*suites__;Iw!!GqivPVa7Brio!PmRE_sLg10gVnn1UZLs1q1TPoTV0SCnw0Foo5QQlZgD03MeL0KIyPVXl0XlvVVRP$
>  
> 
>  
>   
> 
>  >
> 
> GCC itself is one of the benchmarks in CPU2017 (502.gcc_r). And 526.blender_r 
> is even larger than 502.gcc_r. 
> And there are several other quite big benchmarks as well (perlbench, 
> xalancbmk, parest, imagick, etc).
> 
> thanks.
> 
> Qing



[PATCH] --enable-link-serialization support

2020-09-03 Thread Jakub Jelinek via Gcc-patches
On Thu, Sep 03, 2020 at 03:53:35PM +0200, Richard Biener wrote:
> On Thu, 3 Sep 2020, Jakub Jelinek wrote:
> But is that an issue in practice?  I usually do not do make -j32 cc1plus
> in a tree that was configured for bootstrap, nor do I use
> --enable-link-serialization in that case.

Guess most often true, but one could still do it when debugging some
particular problem during bootstrap.

> Btw, do you plan to keep --enable-link-mutex the way it is?  What it
> provides ontop of --enable-link-serialization is that progress
> metering for very long running LTO link stages.  Not sure if that
> works for the last LTO link or only when there's some other link
> waiting.

I kept it as is for now.  Primarily I didn't want to use
--enable-linux-mutex for the new thing when it has nothing to do with a
mutex.  I think with both options together it will just print useless
message that it acquired the lock immediately in each case.
And not really sure how and what should we print as progress indicator,
perhaps Make-hooks could contain in some variable number of the languages
(well, one can use $(words $(CONFIG_LANGUAGES)) for that purpose)
and in other vars record for each FE its serial number) and then add
@echo Linking C++ - $(c++.idx) out of $(words $(CONFIG_LANGUAGES))
right before the $(LLINK) lines, perhaps only added if $(DO_LINK_SERIALIZATION)
is true.  Or perhaps do some ANSI colors progress bar for that too.

Anyway, here is the updated patch that doesn't use recursive make for that,
I think that could run into latent Makefile dependency issues and the like.

For new FEs, the new requirements are basically that one should add
$lang.serial $lang.prev to .PHONY, make $lang.serial depend on the binary
linked with $(LLINKER) and add $lang.prev to the dependencies of that
binary.  Unless, as in the lto special case, there are multiple such
binaries, then $lang.serial should depend on the last one, $lang.prev
should be the dependy of the first one and there should be dependencies
between the binaries guarded on $(DO_LINK_SERIALIZATION) equal to true.

2020-09-03  Jakub Jelinek  

* configure.ac (--enable-link-serialization): New configure option.
* Makefile.tpl (EXTRA_GCC_FLAGS): Pass DO_LINK_SERIALIZATION=true
or false to recursive make in gcc subdirectory.
* configure: Regenerated.
* Makefile.in: Regenerated.
gcc/
* configure.ac: Add $lang.prev: $prev.serial or $lang.prev: rules to
Make-hooks.
* doc/install.texi (--enable-link-serialization): Document.
* configure: Regenerated.
gcc/c/
* Make-lang.in (c.serial): New goal.
(.PHONY): Add c.serial c.prev.
gcc/cp/
* Make-lang.in (c++.serial): New goal.
(.PHONY): Add c++.serial c++.prev.
(cc1plus$(exeext)): Depend on c++.prev.
gcc/fortran/
* Make-lang.in (fortran.serial): New goal.
(.PHONY): Add fortran.serial fortran.prev.
(f951$(exeext)): Depend on fortran.prev.
gcc/lto/
* Make-lang.in (lto, lto.serial): New goals.
(.PHONY): Add fortran.serial fortran.prev.
(lto.all.cross, lto.start.encap): Remove dependencies.
($(LTO_EXE)): Depend on lto.prev.
(LTO_DUMP_EXE_PREV): New variable.
($(LTO_DUMP_EXE)): Depend on $(LTO_DUMP_EXE_PREV).
gcc/objc/
* Make-lang.in (objc.serial): New goal.
(.PHONY): Add objc.serial objc.prev.
(cc1obj$(exeext)): Depend on objc.prev.
gcc/objcp/
* Make-lang.in (obj-c++.serial): New goal.
(.PHONY): Add obj-c++.serial obj-c++.prev.
(cc1objplus$(exeext)): Depend on obj-c++.prev.
gcc/ada/
* gcc-interface/Make-lang.in (ada.serial): New goal.
(.PHONY): Add ada.serial ada.prev.
(gnat1$(exeext)): Depend on ada.prev.
gcc/brig/
* Make-lang.in (brig.serial): New goal.
(.PHONY): Add brig.serial brig.prev.
(brig1$(exeext)): Depend on brig.prev.
gcc/go/
* Make-lang.in (go.serial): New goal.
(.PHONY): Add go.serial go.prev.
(go1$(exeext)): Depend on go.prev.
gcc/jit/
* Make-lang.in (jit.serial): New goal.
(.PHONY): Add jit.serial jit.prev.
($(LIBGCCJIT_FILENAME)): Depend on jit.prev.
gcc/d/
* Make-lang.in (d.serial): New goal.
(.PHONY): Add d.serial d.prev.
(d21$(exeext)): Depend on d.prev.

--- configure.ac.jj 2020-08-24 10:00:01.248259486 +0200
+++ configure.ac2020-09-03 16:09:51.010782912 +0200
@@ -1866,6 +1866,24 @@ AC_ARG_ENABLE(linker-plugin-flags,
   extra_linker_plugin_flags=)
 AC_SUBST(extra_linker_plugin_flags)
 
+dnl Whether to prevent multiple GCC front-ends from linking at the same time
+
+AC_MSG_CHECKING([whether to serialize linking of multiple front-ends])
+  AC_ARG_ENABLE(link-serialization,
+[AS_HELP_STRING([--enable-link-serialization],
+   [avoid linking multiple GCC front-ends at once using make
+dependencies to avoid thrashing on the buil

Re: [PATCH] c++: Disable -frounding-math during manifestly constant evaluation [PR96862]

2020-09-03 Thread Jakub Jelinek via Gcc-patches
On Wed, Sep 02, 2020 at 04:55:14PM -0400, Jason Merrill via Gcc-patches wrote:
> C++ says that division by zero has undefined behavior, and that an
> expression with undefined behavior is not constant, so we shouldn't fold
> 1./0 to inf anyway.  The same is true of other trapping operations.  So
> clearing flag_signaling_nans, flag_trapping_math, and flag_trapv seems wrong
> for C++.  And folding_initializer seems to be used for the same sort of
> thing.

I indeed see:
"If during the evaluation of an expression, the result is not mathematically 
defined or not in the range of
representable values for its type, the behavior is undefined. [Note: Treatment 
of division by zero, forming a
remainder using a zero divisor, and all floating-point exceptions vary among 
machines, and is sometimes
adjustable by a library function. — end note]"

Now, for floating point exceptions, we have several of them.
I'd hope the inexact one shouldn't count, because otherwise we can't
evaluate at compile time pretty much nothing (but we actually seem to ignore
inexact during fold const except for rounding math).
The other exceptions are invalid, which is for the not mathematically
defined cases (0 / 0, inf - inf and the like), yes, they can be expressed as
NaNs, but I guess the above make it undefined.
Then there are overflows into infinities, should that count as not
representable?
Underflows are more like inexact I'd say.
So perhaps we should have a special flag that fold-const.c checks next to
flag_signalling_math and decide what is and is not undefined.

Jakub



Re: [PATCH] Enable GCC support for AMX

2020-09-03 Thread Kirill Yukhin via Gcc-patches
Hello,

On 06 июл 09:58, Hongyu Wang via Gcc-patches wrote:
> Hi:
> 
> This patch is about to support Intel Advanced Matrix Extensions (AMX)
> which will be enabled in GLC.
> 
> AMX is a new 64-bit programming paradigm consisting of two
> compo nents: a set of 2-dimensional registers (tiles) representing
> sub-arrays from a larger 2-dimensional memory image,
> and an accelerator able to operate on tiles
> 
> Supported instructions are
> 
> AMX-TILE:ldtilecfg/sttilecfg/tileloadd/tileloaddt1/tilezero/tilerelease
> AMX-INT8:tdpbssd/tdpbsud/tdpbusd/tdpbuud
> AMX-BF16:tdpbf16ps
> 
> The intrinsics adopts constant tile register number as its input parameters.
> 
> For detailed information, please refer to
> https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
> 
> Bootstrap ok, regression test on i386/x86 backend is ok.
> 
> OK for master?

I was trying to apply your patch to recent master and got
compilation error:

g++ -std=gnu++11  -fno-PIE -c   -g -O2 -DIN_GCC -fno-exceptions -fno-rtti 
-fasynchronous-unwind-tables -W -Wall -Wno-narrowi
ng -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual 
-pedantic -Wno-long-long -Wno-variadic-macros -Wn
o-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. -I. 
-I/export/kyukhin/gcc/src/gcc -I/export/kyukhin/gcc/src/gcc/. -I/expor
t/kyukhin/gcc/src/gcc/../include 
-I/export/kyukhin/gcc/src/gcc/../libcpp/include  
-I/export/kyukhin/gcc/src/gcc/../libdecnumber 
-I/export/kyukhin/gcc/src/gcc/../libdecnumber/bid -I../libdecnumber 
-I/export/kyukhin/gcc/src/gcc/../libbacktrace   -o i386-opti
ons.o -MT i386-options.o -MMD -MP -MF ./.deps/i386-options.TPo 
/export/kyukhin/gcc/src/gcc/config/i386/i386-options.c
/export/kyukhin/gcc/src/gcc/config/i386/i386-options.c: In function ‘bool 
ix86_option_override_internal(bool, gcc_options*, gcc_
options*)’:
/export/kyukhin/gcc/src/gcc/config/i386/i386-options.c:2263:41: error: 
‘PTA_AMX_TILE’ was not declared in this scope
  if (((processor_alias_table[i].flags & PTA_AMX_TILE) != 0)
 ^
/export/kyukhin/gcc/src/gcc/config/i386/i386-options.c:2267:41: error: 
‘PTA_AMX_INT8’ was not declared in this scope
  if (((processor_alias_table[i].flags & PTA_AMX_INT8) != 0)
 ^
/export/kyukhin/gcc/src/gcc/config/i386/i386-options.c:2271:41: error: 
‘PTA_AMX_BF16’ was not declared in this scope
  if (((processor_alias_table[i].flags & PTA_AMX_BF16) != 0)

Could you please fix that?


--
K

PS: Please excuse me for late response.


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches


Hi,

Looks like both attached .csv files were deleted during the email delivery 
procedure. Not sure what’s the reason for this.

Then I have to copy the text file here for you reference:

benchmarks:
C   500.perlbench_r  
C   502.gcc_r 
C   505.mcf_r   
C++ 520.omnetpp_r
C++ 523.xalancbmk_r  
C   525.x264_r
C++ 531.deepsjeng_r
C++ 541.leela_r
C   557.xz_r   
  

C++/C/Fortran   507.cactuBSSN_r  
C++ 508.namd_r
C++ 510.parest_r 
C++/C   511.povray_r   
C   519.lbm_r 
Fortran/C   521.wrf_r 
C++/C   526.blender_r   
Fortran/C   527.cam4_r  
C   538.imagick_r  
C   544.nab_r

***runtime overhead data and code size overhead data, I converted then to PDF 
files, hopefully this time I can attach it with the email:

thanks.

Qing






> On Sep 3, 2020, at 9:29 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> Hi,
> 
> Per request, I collected runtime performance data and code size data with 
> CPU2017 on a X86 platform. 
> 
> *** Machine info:
> model name>-: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> $ lscpu | grep NUMA
> NUMA node(s):  2
> NUMA node0 CPU(s): 0-21,44-65
> NUMA node1 CPU(s): 22-43,66-87
> 
> ***CPU2017 benchmarks: 
> all the benchmarks with C/C++, 9 Integer benchmarks, 10 FP benchmarks. 
> 
> ***Configures:
> Intrate and fprate, 22 copies. 
> 
> ***Compiler options:
> no :  -g -O2 -march=native
> used_gpr_arg: no + -fzero-call-used-regs=used-gpr-arg
> used_arg: no + -fzero-call-used-regs=used-arg
> all_arg:  no + -fzero-call-used-regs=all-arg
> used_gpr: no + -fzero-call-used-regs=used-gpr
> all_gpr:  no + -fzero-call-used-regs=all-gpr
> used: no + -fzero-call-used-regs=used
> all:  no + -fzero-call-used-regs=all
> 
> ***each benchmark runs 3 times. 
> 
> ***runtime performance data:
> Please see the attached csv file
> 
> 
> From the data, we can see that:
> On average, all the options starting with “used_…”  (i.e, only the registers 
> that are used in the routine will be zeroed) have very low runtime overheads, 
> at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
> average. 
> Looks like the overhead of zeroing vector registers is much bigger. 
> 
> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
> runtime overhead with this is very small.
> 
> ***code size increase data:
> 
> Please see the attached file 
> 
> 
> From the data, we can see that:
> The code size impact in general is very small, the biggest is “all_arg”, 
> which is 1.06% for integer benchmark, and 1.13% for FP benchmarks.
> 
> So, from the data collected, I think that the run-time overhead and code size 
> increase from this option are very reasonable. 
> 
> Let me know you comments and opinions.
> 
> thanks.
> 
> Qing
> 
>> On Aug 25, 2020, at 4:54 PM, Qing Zhao via Gcc-patches 
>>  wrote:
>> 
>> 
>> 
>>> On Aug 24, 2020, at 3:20 PM, Segher Boessenkool 
>>>  wrote:
>>> 
>>> Hi!
>>> 
>>> On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
>  wrote:
> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
>>> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>>>  wrote:
>>> Numbers on how expensive this is (for what arch, in code size and in
>>> execution time) would be useful.  If it is so expensive that no one will
>>> use it, it helps security at most none at all :-(
> 
> Without numbers on this, no one can determine if it is a good tradeoff
> for them.  And we (the GCC people) cannot know if it will be useful for
> enough users that it will be worth the effort for us.  Which is why I
> keep hammering on this point.
 I can collect some run-time overhead data on this, do you have a 
 recommendation on what test suite I can use
 For this testing? (Is CPU2017 good enough)?
>>> 
>>> I would use something more real-life, not 12 small pieces of code.
>> 
>> There is some basic information about the benchmarks of CPU2017 in below 
>> link:
>> 
>> https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/overview.html*suites__;Iw!!GqivPVa7Brio!PmRE_sLg10gVnn1UZLs1q1TPoTV0SCnw0Foo5QQlZgD03MeL0KIyPVXl0XlvVVRP$
>>  
>> 

Re: [PATCH] Enable GCC support for AMX

2020-09-03 Thread H.J. Lu via Gcc-patches
On Thu, Sep 3, 2020 at 8:08 AM Kirill Yukhin via Gcc-patches
 wrote:
>
> Hello,
>
> On 06 июл 09:58, Hongyu Wang via Gcc-patches wrote:
> > Hi:
> >
> > This patch is about to support Intel Advanced Matrix Extensions (AMX)
> > which will be enabled in GLC.
> >
> > AMX is a new 64-bit programming paradigm consisting of two
> > compo nents: a set of 2-dimensional registers (tiles) representing
> > sub-arrays from a larger 2-dimensional memory image,
> > and an accelerator able to operate on tiles
> >
> > Supported instructions are
> >
> > AMX-TILE:ldtilecfg/sttilecfg/tileloadd/tileloaddt1/tilezero/tilerelease
> > AMX-INT8:tdpbssd/tdpbsud/tdpbusd/tdpbuud
> > AMX-BF16:tdpbf16ps
> >
> > The intrinsics adopts constant tile register number as its input parameters.
> >
> > For detailed information, please refer to
> > https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
> >
> > Bootstrap ok, regression test on i386/x86 backend is ok.
> >
> > OK for master?
>
> I was trying to apply your patch to recent master and got
> compilation error:
>
> g++ -std=gnu++11  -fno-PIE -c   -g -O2 -DIN_GCC -fno-exceptions -fno-rtti 
> -fasynchronous-unwind-tables -W -Wall -Wno-narrowi
> ng -Wwrite-strings -Wcast-qual -Wmissing-format-attribute 
> -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wn
> o-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. -I. 
> -I/export/kyukhin/gcc/src/gcc -I/export/kyukhin/gcc/src/gcc/. -I/expor
> t/kyukhin/gcc/src/gcc/../include 
> -I/export/kyukhin/gcc/src/gcc/../libcpp/include  
> -I/export/kyukhin/gcc/src/gcc/../libdecnumber
> -I/export/kyukhin/gcc/src/gcc/../libdecnumber/bid -I../libdecnumber 
> -I/export/kyukhin/gcc/src/gcc/../libbacktrace   -o i386-opti
> ons.o -MT i386-options.o -MMD -MP -MF ./.deps/i386-options.TPo 
> /export/kyukhin/gcc/src/gcc/config/i386/i386-options.c
> /export/kyukhin/gcc/src/gcc/config/i386/i386-options.c: In function ‘bool 
> ix86_option_override_internal(bool, gcc_options*, gcc_
> options*)’:
> /export/kyukhin/gcc/src/gcc/config/i386/i386-options.c:2263:41: error: 
> ‘PTA_AMX_TILE’ was not declared in this scope
>   if (((processor_alias_table[i].flags & PTA_AMX_TILE) != 0)
>  ^
> /export/kyukhin/gcc/src/gcc/config/i386/i386-options.c:2267:41: error: 
> ‘PTA_AMX_INT8’ was not declared in this scope
>   if (((processor_alias_table[i].flags & PTA_AMX_INT8) != 0)
>  ^
> /export/kyukhin/gcc/src/gcc/config/i386/i386-options.c:2271:41: error: 
> ‘PTA_AMX_BF16’ was not declared in this scope
>   if (((processor_alias_table[i].flags & PTA_AMX_BF16) != 0)
>
> Could you please fix that?

Here is the rebased patch against

commit 3c219134152f645103f2fcd50735b177ccd76cde
Author: Jonathan Wakely 
Date:   Thu Sep 3 12:38:50 2020 +0100

libstdc++: Optimise GCD algorithms

Thanks.

-- 
H.J.
From 713cafb77a16331620af3eb2c2384a7c388ecd90 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Thu, 25 Jul 2019 16:49:36 +0800
Subject: [PATCH] Enable GCC support for AMX-TILE,AMX-INT8,AMX-BF16.

AMX-TILE:ldtilecfg/sttilecfg/tileloadd/tileloaddt1/tilezero/tilerelease
AMX-INT8:tdpbssd/tdpbsud/tdpbusd/tdpbuud
AMX-BF16:tdpbf16ps

gcc/ChangeLog

	* common/config/i386/i386-common.c (OPTION_MASK_ISA2_AMX_TILE_SET,
	OPTION_MASK_ISA2_AMX_INT8_SET, OPTION_MASK_ISA2_AMX_BF16_SET,
	OPTION_MASK_ISA2_AMX_TILE_UNSET,
	OPTION_MASK_ISA2_AMX_INT8_UNSET, OPTION_MASK_ISA2_AMX_BF16_UNSET):
	New marcos.
	(ix86_handle_option): Hanlde -mamx-tile, -mamx-int8, -mamx-bf16.
	* common/config/i386/i386-cpuinfo.h (processor_types): Add
	FEATURE_AMX_TILE, FEATURE_AMX_INT8, FEATURE_AMX_BF16.
	* common/config/i386/cpuinfo.h (XSTATE_TILECFG,
	XSTATE_TILEDATA, XCR_AMX_ENABLED_MASK): New macro.
	(get_available_features): Enable AMX features only if
	their states are suoorited by OSXSAVE.
	* common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY
	for amx-tile, amx-int8, amx-bf16.
	* config.gcc: Add amxtileintrin.h, amxint8intrin.h,
	amxbf16intrin.h to extra headers.
	* config/i386/amxbf16intrin.h: New file.
	* config/i386/amxint8intrin.h: Ditto.
	* config/i386/amxtileintrin.h: Ditto.
	* config/i386/cpuid.h (bit_AMX_BF16, bit_AMX_TILE, bit_AMX_INT8):
	New macro.
	* config/i386/i386-c.c (ix86_target_macros_internal): Define
	__AMX_TILE__, __AMX_INT8__, AMX_BF16__.
	* config/i386/i386-options.c (ix86_target_string): Add
	-mamx-tile, -mamx-int8, -mamx-bf16.
	(ix86_option_override_internal): Handle AMX-TILE,
	AMX-INT8, AMX-BF16.
	* config/i386/i386.h (TARGET_AMX_TILE, TARGET_AMX_TILE_P,
	TARGET_AMX_INT8, TARGET_AMX_INT8_P, TARGET_AMX_BF16_P,
	PTA_AMX_TILE, PTA_AMX_INT8, PTA_AMX_BF16): New macros.
	* config/i386/i386.opt: Add -mamx-tile, -mamx-int8, -mamx-bf16.
	* config/i386/immintrin.h: Include amxtileintrin.h,
	amxint8intrin.h, amxbf16intrin.h.
	* doc/invoke.texi: Document -mamx-tile, -mamx-int8, -mamx-bf16.
	* doc/extend.texi: Document amx-tile, a

Re: [PATCH] dwarf: Multi-register CFI address support

2020-09-03 Thread Andrew Stubbs

On 28/08/2020 13:04, Andrew Stubbs wrote:

Hi all,

This patch introduces DWARF CFI support for architectures that require 
multiple registers to hold pointers, such as the stack pointer, frame 
pointer, and return address. The motivating case is the AMD GCN 
architecture which has 64-bit address pointers, but 32-bit registers.


The current implementation permits program variables to span as many 
registers as they need, but assumes that CFI expressions will only need 
a single register for each frame value.


To be fair, the DWARF standard makes a similar assumption; the engineers 
working on LLVM and GDB, at AMD, have therefore invented some new DWARF 
operators that they plan to propose for a future standard. Only one is 
relevant here, however: DW_OP_LLVM_piece_end. (Unfortunately this 
clashes with an AArch64 extension, but I think we can cope using an 
alias -- only GCC dumps will be confusing.)


My approach is to change the type representing a DWARF register 
throughout the CFI code. This permits the register span information to 
propagate to where it is needed.


I've taken advantage of C++ struct copies and operator== to minimize the 
amount of refactoring required. I'm not sure this meets the GCC 
guidelines exactly, but if not I can change that once the basic form is 
agreed. (I also considered an operator= to make assigning single dwreg 
values transparent, but that hid too many invalid assumptions.)


OK to commit? (Although, I'll hold off until AMD release the compatible 
GDB.)


Minor patch update, following Tom's feedback.

Andrew
dwarf: Multi-register CFI address support

Add support for architectures such as AMD GCN, in which the pointer size is
larger than the register size.  This allows the CFI information to include
multi-register locations for the stack pointer, frame pointer, and return
address.

Note that this uses a newly proposed DWARF operator DW_OP_LLVM_piece_end,
which is currently only recognized by the ROCGDB debugger from AMD.  The exact
name and encoding for this operator is subject to change if and when the DWARF
standard accepts it.

gcc/ChangeLog:

	* dwarf2cfi.c (dw_stack_pointer_regnum): Change type to struct cfa_reg.
	(dw_frame_pointer_regnum): Likewise.
	(new_cfi_row): Use set_by_dwreg.
	(get_cfa_from_loc_descr): Use set_by_dwreg.  Support register spans
	with DW_OP_piece and DW_OP_LLVM_piece_end.  Support DW_OP_lit*,
	DW_OP_const*, DW_OP_minus, and DW_OP_plus.
	(lookup_cfa_1): Use set_by_dwreg.
	(def_cfa_0): Update for cfa_reg and support register spans.
	(reg_save): Change sreg parameter to struct cfa_reg.  Support register
	spans.
	(dwf_cfa_reg): New function.
	(dwarf2out_flush_queued_reg_saves): Use dwf_cfa_reg instead of
	dwf_regno.
	(dwarf2out_frame_debug_def_cfa): Likewise.
	(dwarf2out_frame_debug_adjust_cfa): Likewise.
	(dwarf2out_frame_debug_cfa_offset): Likewise.  Update reg_save usage.
	(dwarf2out_frame_debug_cfa_register): Likewise.
	(dwarf2out_frame_debug_expr): Likewise.
	(create_pseudo_cfg): Use set_by_dwreg.
	(initial_return_save): Use set_by_dwreg and dwf_cfa_reg,
	(create_cie_data): Use dwf_cfa_reg.
	(execute_dwarf2_frame): Use dwf_cfa_reg.
	(dump_cfi_row): Use set_by_dwreg.
	* dwarf2out.c (build_span_loc): New function.
	(build_cfa_loc): Support register spans.
	(build_cfa_aligned_loc): Update cfa_reg usage.
	(convert_cfa_to_fb_loc_list): Use set_by_dwreg.
	* dwarf2out.h (struct cfa_reg): New type.
	(struct dw_cfa_location): Use struct cfa_reg.
	(build_span_loc): New prototype.
	* gengtype.c (main): Accept poly_uint16_pod type.

include/ChangeLog:

	* dwarf2.def (DW_OP_LLVM_piece_end): New extension operator.

diff --git a/gcc/dwarf2cfi.c b/gcc/dwarf2cfi.c
index 0d179b388e4..63cb6de5c4f 100644
--- a/gcc/dwarf2cfi.c
+++ b/gcc/dwarf2cfi.c
@@ -229,8 +229,8 @@ static vec queued_reg_saves;
 static bool any_cfis_emitted;
 
 /* Short-hand for commonly used register numbers.  */
-static unsigned dw_stack_pointer_regnum;
-static unsigned dw_frame_pointer_regnum;
+static struct cfa_reg dw_stack_pointer_regnum;
+static struct cfa_reg dw_frame_pointer_regnum;
 
 /* Hook used by __throw.  */
 
@@ -430,7 +430,7 @@ new_cfi_row (void)
 {
   dw_cfi_row *row = ggc_cleared_alloc ();
 
-  row->cfa.reg = INVALID_REGNUM;
+  row->cfa.reg.set_by_dwreg (INVALID_REGNUM);
 
   return row;
 }
@@ -538,7 +538,11 @@ get_cfa_from_loc_descr (dw_cfa_location *cfa, struct dw_loc_descr_node *loc)
   cfa->offset = 0;
   cfa->base_offset = 0;
   cfa->indirect = 0;
-  cfa->reg = -1;
+  cfa->reg.set_by_dwreg (-1);
+
+  /* Record previous register pieces here.  */
+  struct cfa_reg span;
+  span.set_by_dwreg (INVALID_REGNUM);
 
   for (ptr = loc; ptr != NULL; ptr = ptr->dw_loc_next)
 {
@@ -578,10 +582,10 @@ get_cfa_from_loc_descr (dw_cfa_location *cfa, struct dw_loc_descr_node *loc)
 	case DW_OP_reg29:
 	case DW_OP_reg30:
 	case DW_OP_reg31:
-	  cfa->reg = op - DW_OP_reg0;
+	  cfa->reg.set_by_dwreg (op - DW_OP_reg0);
 	  break;
 	case DW_OP_regx:
-	  cfa->reg = ptr->dw_loc_oprnd1.

Re: [PATCH, rs6000] Fix Vector long long subtype (PR96139)

2020-09-03 Thread will schmidt via Gcc-patches
On Wed, 2020-09-02 at 05:13 -0500, Segher Boessenkool wrote:
> Hi Will,
> 
> On Tue, Sep 01, 2020 at 09:00:20PM -0500, will schmidt wrote:
> >   This corrects an issue with the powerpc vector long long
> > subtypes.
> > As reported by SjMunroe in PR96139.  When building some code with
> > -Wall
> > and attempting to print an element of a "long long vector" with a
> > long long
> > printf format string, we will report a error because the vector
> > sub-type
> > was improperly defined as int.
> > 
> > When defining a V2DI_type_node we use a TARGET_POWERPC64 ternary to
> > define the V2DI_type_node with "vector long" or "vector long long".
> > We also need to specify the proper sub-type when we define the
> > type.
> > -  V2DI_type_node = rs6000_vector_type (TARGET_POWERPC64 ?
> > "__vector long"
> > -  : "__vector long long",
> > -  intDI_type_node, 2);
> > +  V2DI_type_node
> > += rs6000_vector_type (TARGET_POWERPC64
> > + ? "__vector long" : "__vector long long",
> > + TARGET_POWERPC64
> > + ? long_long_integer_type_node :
> > intDI_type_node,
> > + 2);
> 
> Can't you just use long_long_integer_type_node in all cases?  Or,
> what
> else is intDI_type_node for 32 bit?

I'm not sure offhand.  I'm tending to assume the use of intDI_type_node
is critical for some underlying reason.I'll give this a spin with
just long_long_integer_type_node and see what happens.  
thanks,


> 
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/powerpc/pr96139-a.c
> > @@ -0,0 +1,32 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -Wall -m32 " } */
> 
> (trailing space, here and elsewhere -- not that it matters of course)

yup, will fix.  thanks for the review. 

-Will

> 
> 
> Segher



[committed] libstdc++: Add workaround for weird std::tuple error [PR 96592]

2020-09-03 Thread Jonathan Wakely via Gcc-patches
This "fix" makes no sense, but it avoids an error from G++ about
std::is_constructible being incomplete. The real problem is elsewhere,
but this "fixes" the regression for now.

libstdc++-v3/ChangeLog:

PR libstdc++/96592
* include/std/tuple (_TupleConstraints): Use
alternative is_constructible instead of std::is_constructible.
* testsuite/20_util/tuple/cons/96592.cc: New test.

Tested powerpc64le-linux. Committed to trunk.

This needs to go on the gcc-10 branch too.


commit 032a4b42cc5f2105f622690ce2552f1c30e1d227
Author: Jonathan Wakely 
Date:   Thu Sep 3 16:26:16 2020

libstdc++: Add workaround for weird std::tuple error [PR 96592]

This "fix" makes no sense, but it avoids an error from G++ about
std::is_constructible being incomplete. The real problem is elsewhere,
but this "fixes" the regression for now.

libstdc++-v3/ChangeLog:

PR libstdc++/96592
* include/std/tuple (_TupleConstraints): Use
alternative is_constructible instead of std::is_constructible.
* testsuite/20_util/tuple/cons/96592.cc: New test.

diff --git a/libstdc++-v3/include/std/tuple b/libstdc++-v3/include/std/tuple
index 1c22d4db788..06f56337ce4 100644
--- a/libstdc++-v3/include/std/tuple
+++ b/libstdc++-v3/include/std/tuple
@@ -539,6 +539,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template
 struct _TupleConstraints
 {
+  template // Workaround for PR 96592
+   using is_constructible
+ = __bool_constant<__is_constructible(_Tp, _Up)>;
+
   // Constraint for a non-explicit constructor.
   // True iff each Ti in _Types... can be constructed from Ui in _UTypes...
   // and every Ui is implicitly convertible to Ti.
diff --git a/libstdc++-v3/testsuite/20_util/tuple/cons/96592.cc 
b/libstdc++-v3/testsuite/20_util/tuple/cons/96592.cc
new file mode 100644
index 000..326ab0ef2a6
--- /dev/null
+++ b/libstdc++-v3/testsuite/20_util/tuple/cons/96592.cc
@@ -0,0 +1,58 @@
+// Copyright (C) 2020 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// .
+
+// { dg-do compile { target c++11 } }
+
+#include 
+
+// PR 96592 comment 0
+
+template 
+struct SomeQuery {
+SessionT& session_;
+SomeQuery(SessionT& session) : session_(session) {}
+};
+
+template 
+struct Handler {
+std::tuple> queries_;
+Handler(SessionT& session) : queries_(session) {}
+};
+
+struct Session {
+Handler handler_;
+Session() : handler_{*this} {}
+};
+
+Session session;
+
+// PR 96592 comment 1
+template 
+class DependsOnT
+{
+public:
+DependsOnT(T&) {}
+};
+
+class Test
+{
+public:
+Test() : test_{*this} {}
+
+private:
+std::tuple> test_;
+};


Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Segher Boessenkool
On Thu, Sep 03, 2020 at 07:03:53AM -0300, Alexandre Oliva wrote:
> On Sep  2, 2020, Segher Boessenkool  wrote:
> > (And aarch, but not the other targets with default clobbers).
> 
> And arm.  I didn't look very hard for others yet; I wasn't sure it would
> be worth pursuing any further.

Oh, I meant what was in your patch.  From a grep, all of mn10300, nds32,
arm, cris, pdp11, rs6000, i386, visium, aarch64 have default clobbers.
Hrm, what you said was the targets with inline asm flag output
constraints?

> > People write empty asm statements not because they would like no insns
> > emitted from it, but *because* they want the other effects an asm has
> > (for example, an empty asm usually has no outputs, so it is volatile,
> > and that makes sure it is executed in the real machine exactly as often
> > as in the abstract machine).  So your expectation might be wrong,
> > someone might want an empty asm to clobber cc on x86 (like any asm is
> > documented as doing).
> 
> I just can't imagine a case in which flags set by an earlier
> instruction, and that could be usable by a subsequent instruction, could
> possibly be clobbered by an *empty* sequence of insns.  Maybe I just
> lack imagination about all the sorts of weird things that people use
> such asm statements for...  Got any example of something that could be
> broken by such a change to educate me?  Not that I need it, I'm just
> curious.

Before your change, flags could not be live through an asm.  After your
change, it can.  So it does change things...  It seems like this should
never matter, but :-)

> > But how about a "none" clobber?
> 
> I like that!

Okay, sold!  Now to convince everyone else ;-)

> I don't know.  There's a piece of hand-crafted x86 assembly that I'm
> trying to find some way to replicate in compiler-optimized code.  I
> don't know whether two RMWs could possibly be faster, but there are two
> concerns driving me to try to use the same insn sequence:
> 
> - compactness; this is for per-block instrumentation, and if we use 4
>   larger insns instead of 2 smaller ones, there might be even more
>   significant impact on performance because of code cache issues

Yes, it *is* smaller code, certainly.  As I said before, GCC does not
generate very good code with RMWs :-/

> >> If we retried the 3-insn substitution after substituting the flag store
> >> into the add for the adc,
> 
> > combine should retry every combination if any of the input insns to it
> > have changed (put another way, if any insn is changed all combinations
> > with it are tried anew).
> 
> Alas, they don't change, it's an intervening flag-store that does.  When
> we try the 3 insns that make up RMW, while the flag store is still there
> before the W, it triggers one of the cant_combine tests.  After it's
> gone, we don't reconsider.  Does that count as a bug to be filed?

An enhancement bug, sure...  "missing optimisation".  But this is
expected behaviour at least :-)

> >> we might succeed, but only if we had a pattern
> >> that matched add3_cc_overflow_1's parallel with the flag-setter as
> >> the second element of the parallel, because that's where combine adds it
> >> to the new i3 pattern, after splitting it out of i2.
> 
> > That sounds like the backend pattern has it wrong then?  There is a
> > canonical order for this?
> 
> Much as I can tell, there isn't, it's just an arbitrary choice of
> backends, some do it one way or the other, and that causes combine to be
> able to perform some combinations but not others.

Hrm, I could swear it was in the documentation somewhere, but I cannot
find it currently.  I can find comments in combine.c that refer to this
though, like
  /* Many machines that don't use CC0 have insns that can both perform an
 arithmetic operation and set the condition code.  These operations will
 be represented as a PARALLEL with the first element of the vector
 being a COMPARE of an arithmetic operation with the constant zero.
 The second element of the vector will set some pseudo to the result
 of the same arithmetic operation.

Ah found it (md.texi, @cindex @code{compare}, canonicalization of):

@item
For instructions that inherently set a condition code register, the
@code{compare} operator is always written as the first RTL expression of
the @code{parallel} instruction pattern.  For example,

@smallexample
(define_insn ""
  [(set (reg:CCZ FLAGS_REG)
(compare:CCZ
  (plus:SI
(match_operand:SI 1 "register_operand" "%r")
(match_operand:SI 2 "register_operand" "r"))
  (const_int 0)))
   (set (match_operand:SI 0 "register_operand" "=r")
(plus:SI (match_dup 1) (match_dup 2)))]
  ""
  "addl %0, %1, %2")
@end smallexample

Not sure this 100% applies to your case though?

> >> I suppose adding such patterns manually isn't the way to go.  I wonder
> >> if getting recog_for_combine to recognize and reorder PARALLELs
> >> appearing out of order would get too expensive, eve

Re: [patch] PR tree-optimization/96818 - cast label range to type of switch operand

2020-09-03 Thread Andrew MacLeod via Gcc-patches

On 9/3/20 3:43 AM, Richard Biener wrote:

On Thu, Sep 3, 2020 at 9:21 AM Aldy Hernandez  wrote:



On 8/31/20 2:55 PM, Richard Biener wrote:

On Mon, Aug 31, 2020 at 11:10 AM Aldy Hernandez  wrote:



On 8/31/20 10:33 AM, Richard Biener wrote:

On Mon, Aug 31, 2020 at 10:20 AM Aldy Hernandez  wrote:

As discussed in the PR, the type of a switch operand may be different
than the type of the individual cases.  This is causing us to attempt to
intersect incompatible ranges.

This inconsistent IL seems wrong to me, but it also seems pervasive
throughout GCC.  Jason, as one of the original gimple designers, do you
remember if this was an oversight, or was there a reason for this?

Richard, you mention in the PR that normalizing the IL would cause
propagation of widening cast sources into switches harder.  Couldn't we
just cast all labels to the switch operand type upon creation?  What
would be the problem with that?  Would we generate sub-optimal code?

In either case, I'd hate to leave trunk in a broken case while this is
being discussed.  The attached patch fixes the PRs.

OK?

  widest_irange label_range (CASE_LOW (min_label), case_high);
+  if (!types_compatible_p (label_range.type (), range_of_op->type ()))
+   range_cast (label_range, range_of_op->type ());
  label_range.intersect (range_of_op);

so label_range is of 'widest_int' type but then instead of casting
range_of_op to widest_irange you cast that widest_irange to the
type of the range op?  Why not build label_range in the type
of range_op immediately?

The "widest" in widest_irange does not denote the type of the range, but
the number of sub-ranges that can maximally fit.  It has nothing to do
with the underlying type of its elements.  Instead, it is supposed to
indicate that label_range can fit an infinite amount of sub-ranges.  In
practice this is 255 sub-ranges, but we can adjust that if needed.

That's definitely confusing naming then :/

I've been mulling this over, and you're right.  The naming does imply
some relation to widest_int.

Originally I batted some ideas on naming this, but came up short.  I'm
still not sure what it should be.

max_irange?

biggest_irange?

Perhaps some name involving "int_range", since int_range<> is how ranges
are declared (int_range_max??).

The issue is that widest_irange has a connection to the type of the range
while it seems to constrain the number of subranges.  So if there's
a irange<1>, irange<2>, etc. then irange_max would be a typedef that
makes most sense (both <2> and _max are then suffixes).

Hmm, a simple grep reveals

value-range.h:typedef int_range<255> widest_irange;
value-range.h:class GTY((user)) int_range : public irange

so widest_irange is not an irange but an int_range?  But then
there's little use of 'irange' or 'int_range', most code uses
either irange & arguments (possibly accepting sth else than
int_range) and variables are widest_irange or irange3
(typedefed from int_range<3>).

typedef int_range<3> irange3;

isnt a real thing.. its a typedef Aldy used in the range-ops self tests 
as shorthand.. I think its a leftover from a previous incarnation of 
irange.




IMHO a typedef irange* to int_range is confusing a bit.

Why isn't it int_range3 or widest_int_rage?  Why is there
irange and int_range?


irange is the generic class that is used to operate on integral ranges.  
It implements all the range operations and provides the API. it is 
agnostic to the number of sub ranges.


int_range is the template that instantiates an irange object, and 
associates enough memory to represent the number of sub-ranges you want 
to work with.. so it is a bit more than a typedef, its a derived class 
which adds memory.


declaring a function
foo (int_range<3> &result, op1, op2)

wont allow you to pass an int_range<4> as the argument... it will take 
nothing but an int_range<3> as they are different types in c++... which 
is very limiting


foo (irange &result, op1, op2)

on the other hand allows you to pass any implementation/instantiation 
around you want. So we use irange everywhere we want to work with an 
generic integral range, and when you need to declare a variable to store 
a range, you use int_range





Well, still stand with irange_max, or, preferably int_range_max.
Or ...

template
class GTY((user)) int_range : public irange
{

so people can use

int_range x;

and get the max range by default?

Indeed, I had forgotten about default template arguments, the only 
problem is


int_range x;

is valid in c++17, but previous to that we have to do

int_range<> x;

Its a bit ugly, and possibly a bit less natural, But it does show the 
intent.


So I think the 2 basic choices are:

int_range<> r;
int_range_max r;

I'm thinking 'int_range_max'?    but I really don't feel strongly one 
way or the other...   Eventually we'll probably end up with c++17 and we 
can revert to  'int_range' everywhere.
If we go with the _max suffix, then we should stick to the "int_range" 
pattern 

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches
Looks like that the PDF attachments do not work with this alias either. 
H.J. LU helped me to upload the performance data and code size data to the 
following wiki page:

https://gitlab.com/x86-gcc/gcc/-/wikis/Zero-call-used-registers-data

Please refer to this link for the data.

thanks.

Qing

> On Sep 3, 2020, at 10:08 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> Hi,
> 
> Looks like both attached .csv files were deleted during the email delivery 
> procedure. Not sure what’s the reason for this.
> 
> Then I have to copy the text file here for you reference:
> 
> benchmarks:
> C   500.perlbench_r  
> C   502.gcc_r 
> C   505.mcf_r   
> C++ 520.omnetpp_r
> C++ 523.xalancbmk_r  
> C   525.x264_r
> C++ 531.deepsjeng_r
> C++ 541.leela_r
> C   557.xz_r   
> 
> 
> C++/C/Fortran   507.cactuBSSN_r  
> C++ 508.namd_r
> C++ 510.parest_r 
> C++/C   511.povray_r   
> C   519.lbm_r 
> Fortran/C   521.wrf_r 
> C++/C   526.blender_r   
> Fortran/C   527.cam4_r  
> C   538.imagick_r  
> C   544.nab_r
> 
> ***runtime overhead data and code size overhead data, I converted then to PDF 
> files, hopefully this time I can attach it with the email:
> 
> thanks.
> 
> Qing
> 
> 
> 
> 
> 
> 
>> On Sep 3, 2020, at 9:29 AM, Qing Zhao via Gcc-patches 
>>  wrote:
>> 
>> Hi,
>> 
>> Per request, I collected runtime performance data and code size data with 
>> CPU2017 on a X86 platform. 
>> 
>> *** Machine info:
>> model name>-: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> $ lscpu | grep NUMA
>> NUMA node(s):  2
>> NUMA node0 CPU(s): 0-21,44-65
>> NUMA node1 CPU(s): 22-43,66-87
>> 
>> ***CPU2017 benchmarks: 
>> all the benchmarks with C/C++, 9 Integer benchmarks, 10 FP benchmarks. 
>> 
>> ***Configures:
>> Intrate and fprate, 22 copies. 
>> 
>> ***Compiler options:
>> no : -g -O2 -march=native
>> used_gpr_arg:no + -fzero-call-used-regs=used-gpr-arg
>> used_arg:no + -fzero-call-used-regs=used-arg
>> all_arg: no + -fzero-call-used-regs=all-arg
>> used_gpr:no + -fzero-call-used-regs=used-gpr
>> all_gpr: no + -fzero-call-used-regs=all-gpr
>> used:no + -fzero-call-used-regs=used
>> all: no + -fzero-call-used-regs=all
>> 
>> ***each benchmark runs 3 times. 
>> 
>> ***runtime performance data:
>> Please see the attached csv file
>> 
>> 
>> From the data, we can see that:
>> On average, all the options starting with “used_…”  (i.e, only the registers 
>> that are used in the routine will be zeroed) have very low runtime 
>> overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>> benchmarks. 
>> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
>> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
>> average. 
>> Looks like the overhead of zeroing vector registers is much bigger. 
>> 
>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
>> runtime overhead with this is very small.
>> 
>> ***code size increase data:
>> 
>> Please see the attached file 
>> 
>> 
>> From the data, we can see that:
>> The code size impact in general is very small, the biggest is “all_arg”, 
>> which is 1.06% for integer benchmark, and 1.13% for FP benchmarks.
>> 
>> So, from the data collected, I think that the run-time overhead and code 
>> size increase from this option are very reasonable. 
>> 
>> Let me know you comments and opinions.
>> 
>> thanks.
>> 
>> Qing
>> 
>>> On Aug 25, 2020, at 4:54 PM, Qing Zhao via Gcc-patches 
>>>  wrote:
>>> 
>>> 
>>> 
 On Aug 24, 2020, at 3:20 PM, Segher Boessenkool 
  wrote:
 
 Hi!
 
 On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
>> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
>>  wrote:
>> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
 On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
  wrote:
 Numbers on how expensive this is (for what arch, in code size and in
 execution time) would be useful.  If it is so expensive that no one 
 will
 use it, it helps security at most none at all :-(
>> 
>> Without numbers on this, no one can determine if it is a good tradeoff
>> for them.  And we (the GCC people) cannot know if it will be useful for
>> enough users that it will be worth the effort for us.  Which is why I
>> keep hammering on this point.
> I can collect some run-time overhead data on this, do you have a 
> recommendation on what test suite I can use
> For this testing? (Is CPU2017 good enough)?
 
 I would use something more real-life, not 12 small pieces of code.
>>> 
>>> There is some basic information about the benchmarks of CPU2017 in below 
>>> link:
>>> 
>>> https://urldefense.com/v3/__

[PATCH] bpf: generate indirect calls for xBPF

2020-09-03 Thread David Faust via Gcc-patches
This patch updates the BPF back end to generate indirect calls via
the 'call %reg' instruction when targetting xBPF.

Additionally, the BPF ASM_SPEC is updated to pass along -mxbpf to
gas, where it is now supported.

2020-09-03  David Faust  

gcc/

* config/bpf/bpf.h (ASM_SPEC): Pass -mxbpf to gas, if specified.
* config/bpf/bpf.c (bpf_output_call): Support indirect calls in xBPF.

gcc/testsuite/

* gcc.target/bpf/xbpf-indirect-call-1.c: New test.
---
 gcc/config/bpf/bpf.c  |  9 ++--
 gcc/config/bpf/bpf.h  |  2 +-
 .../gcc.target/bpf/xbpf-indirect-call-1.c | 21 +++
 3 files changed, 29 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/bpf/xbpf-indirect-call-1.c

diff --git a/gcc/config/bpf/bpf.c b/gcc/config/bpf/bpf.c
index 972a91adcd8..13181f21c5b 100644
--- a/gcc/config/bpf/bpf.c
+++ b/gcc/config/bpf/bpf.c
@@ -705,8 +705,13 @@ bpf_output_call (rtx target)
break;
   }
 default:
-  error ("indirect call in function, which are not supported by eBPF");
-  output_asm_insn ("call 0", NULL);
+  if (TARGET_XBPF)
+   output_asm_insn ("call\t%0", &target);
+  else
+   {
+ error ("indirect call in function, which are not supported by eBPF");
+ output_asm_insn ("call 0", NULL);
+   }
   break;
 }
 
diff --git a/gcc/config/bpf/bpf.h b/gcc/config/bpf/bpf.h
index 940029ba606..359f389a134 100644
--- a/gcc/config/bpf/bpf.h
+++ b/gcc/config/bpf/bpf.h
@@ -22,7 +22,7 @@
 
 / Controlling the Compilation Driver.  */
 
-#define ASM_SPEC "%{mbig-endian:-EB} %{!mbig-endian:-EL}"
+#define ASM_SPEC "%{mbig-endian:-EB} %{!mbig-endian:-EL} %{mxbpf:-mxbpf}"
 #define LINK_SPEC "%{mbig-endian:-EB} %{!mbig-endian:-EL}"
 #define LIB_SPEC ""
 #define STARTFILE_SPEC ""
diff --git a/gcc/testsuite/gcc.target/bpf/xbpf-indirect-call-1.c 
b/gcc/testsuite/gcc.target/bpf/xbpf-indirect-call-1.c
new file mode 100644
index 000..dc4b3cfb12d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/xbpf-indirect-call-1.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-mxbpf" } */
+
+/* GCC should generate an indirect call instruction (call %REG)
+   when targetting xBPF.  */
+
+void
+foo ()
+{
+  ;
+}
+
+void
+bar()
+{
+  void (*funp) () = &foo;
+
+  (*funp) ();
+}
+
+/* { dg-final { scan-assembler "call\t%r" } } */
-- 
2.26.2



Re: [patch] PR tree-optimization/96818 - cast label range to type of switch operand

2020-09-03 Thread Richard Biener via Gcc-patches
On September 3, 2020 5:39:18 PM GMT+02:00, Andrew MacLeod  
wrote:
>On 9/3/20 3:43 AM, Richard Biener wrote:
>> On Thu, Sep 3, 2020 at 9:21 AM Aldy Hernandez 
>wrote:
>>>
>>>
>>> On 8/31/20 2:55 PM, Richard Biener wrote:
 On Mon, Aug 31, 2020 at 11:10 AM Aldy Hernandez 
>wrote:
>
>
> On 8/31/20 10:33 AM, Richard Biener wrote:
>> On Mon, Aug 31, 2020 at 10:20 AM Aldy Hernandez
> wrote:
>>> As discussed in the PR, the type of a switch operand may be
>different
>>> than the type of the individual cases.  This is causing us to
>attempt to
>>> intersect incompatible ranges.
>>>
>>> This inconsistent IL seems wrong to me, but it also seems
>pervasive
>>> throughout GCC.  Jason, as one of the original gimple designers,
>do you
>>> remember if this was an oversight, or was there a reason for
>this?
>>>
>>> Richard, you mention in the PR that normalizing the IL would
>cause
>>> propagation of widening cast sources into switches harder. 
>Couldn't we
>>> just cast all labels to the switch operand type upon creation? 
>What
>>> would be the problem with that?  Would we generate sub-optimal
>code?
>>>
>>> In either case, I'd hate to leave trunk in a broken case while
>this is
>>> being discussed.  The attached patch fixes the PRs.
>>>
>>> OK?
>>   widest_irange label_range (CASE_LOW (min_label),
>case_high);
>> +  if (!types_compatible_p (label_range.type (),
>range_of_op->type ()))
>> +   range_cast (label_range, range_of_op->type ());
>>   label_range.intersect (range_of_op);
>>
>> so label_range is of 'widest_int' type but then instead of
>casting
>> range_of_op to widest_irange you cast that widest_irange to the
>> type of the range op?  Why not build label_range in the type
>> of range_op immediately?
> The "widest" in widest_irange does not denote the type of the
>range, but
> the number of sub-ranges that can maximally fit.  It has nothing
>to do
> with the underlying type of its elements.  Instead, it is supposed
>to
> indicate that label_range can fit an infinite amount of
>sub-ranges.  In
> practice this is 255 sub-ranges, but we can adjust that if needed.
 That's definitely confusing naming then :/
>>> I've been mulling this over, and you're right.  The naming does
>imply
>>> some relation to widest_int.
>>>
>>> Originally I batted some ideas on naming this, but came up short. 
>I'm
>>> still not sure what it should be.
>>>
>>> max_irange?
>>>
>>> biggest_irange?
>>>
>>> Perhaps some name involving "int_range", since int_range<> is how
>ranges
>>> are declared (int_range_max??).
>> The issue is that widest_irange has a connection to the type of the
>range
>> while it seems to constrain the number of subranges.  So if there's
>> a irange<1>, irange<2>, etc. then irange_max would be a typedef that
>> makes most sense (both <2> and _max are then suffixes).
>>
>> Hmm, a simple grep reveals
>>
>> value-range.h:typedef int_range<255> widest_irange;
>> value-range.h:class GTY((user)) int_range : public irange
>>
>> so widest_irange is not an irange but an int_range?  But then
>> there's little use of 'irange' or 'int_range', most code uses
>> either irange & arguments (possibly accepting sth else than
>> int_range) and variables are widest_irange or irange3
>> (typedefed from int_range<3>).
>typedef int_range<3> irange3;
>
>isnt a real thing.. its a typedef Aldy used in the range-ops self tests
>
>as shorthand.. I think its a leftover from a previous incarnation of 
>irange.
>
>>
>> IMHO a typedef irange* to int_range is confusing a bit.
>>
>> Why isn't it int_range3 or widest_int_rage?  Why is there
>> irange and int_range?
>
>irange is the generic class that is used to operate on integral
>ranges.  
>It implements all the range operations and provides the API. it is 
>agnostic to the number of sub ranges.
>
>int_range is the template that instantiates an irange object, and 
>associates enough memory to represent the number of sub-ranges you want
>
>to work with.. so it is a bit more than a typedef, its a derived class 
>which adds memory.
>
>declaring a function
>foo (int_range<3> &result, op1, op2)
>
>wont allow you to pass an int_range<4> as the argument... it will take 
>nothing but an int_range<3> as they are different types in c++... which
>
>is very limiting
>
>foo (irange &result, op1, op2)
>
>on the other hand allows you to pass any implementation/instantiation 
>around you want. So we use irange everywhere we want to work with an 
>generic integral range, and when you need to declare a variable to
>store 
>a range, you use int_range
>
>
>>
>> Well, still stand with irange_max, or, preferably int_range_max.
>> Or ...
>>
>> template
>> class GTY((user)) int_range : public irange
>> {
>>
>> so people can use
>>
>> int_range x;
>>
>> and get the max range by default?
>>
>Indeed, I had forgotten about default template arguments, the only 

Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Alexandre Oliva
[dropping port maintainers for combine-mostly subthread]

On Sep  3, 2020, Segher Boessenkool  wrote:

> On Thu, Sep 03, 2020 at 07:03:53AM -0300, Alexandre Oliva wrote:
>> On Sep  2, 2020, Segher Boessenkool  wrote:
>> >> we might succeed, but only if we had a pattern
>> >> that matched add3_cc_overflow_1's parallel with the flag-setter as
>> >> the second element of the parallel, because that's where combine adds it
>> >> to the new i3 pattern, after splitting it out of i2.
>> 
>> > That sounds like the backend pattern has it wrong then?  There is a
>> > canonical order for this?
>> 
>> Much as I can tell, there isn't, it's just an arbitrary choice of
>> backends, some do it one way or the other, and that causes combine to be
>> able to perform some combinations but not others.

> For instructions that inherently set a condition code register, the
> @code{compare} operator is always written as the first RTL expression of
> the @code{parallel} instruction pattern.

Interesting.  I'm pretty sure I read email recently that suggested it
was really up to the port, but I've caught up with GCC emails from years
ago, so that might have been it.  Or I just misremember.  Whatever.

The x86 pattern that fails to match in combine has the flags setter
first, but combine places it second, after splitting it out of i2 and
then appending it back to i3.

Alas, it would be just as legitimate for combine to go the opposite way,
substituting the flags set into another insn, and then tacking the other
set onto the substituted-into insn.

Since there is a canonical order, maybe combine should attempt to follow
that order.  Meanwhile, there's the patchlet below, that gets us (*) a
grand total of *one* additional insn combine (in libbacktrace/dwarf.c)
per bootstrap stage, and then, in target libs, 3 permuted combines in
libhsail-rt/rt/workitems.c and 4 in s-expmod.adb.  Not bad, but I'd
agree it could be regarded as a bit of a waste.

This is not finished, I'd like to abstract out a little more of the
permutation representation, that had to be explicitly used in combine to
extract the actual clobber count before permuting the newly-created
rtvec, but I was also surprised at how easy it turned out to be.  I was
also happy I missed a few orders of magnitude in my earlier estimates
about how far we could go into (representing) permutations without
taking up all of num_clobbers.  It's so loose that I went for bitfields
rather than arithmetic coding.  But the reason I wanted to abstract it
out was to make it easy to switch to a different encoding.

Anyway...  Does this still seem worth pursuing?


permute parallel items in recog

From: Alexandre Oliva 


---
 gcc/combine.c  |   35 
 gcc/genrecog.c |   78 --
 gcc/rtl.h  |   81 
 3 files changed, 156 insertions(+), 38 deletions(-)

diff --git a/gcc/combine.c b/gcc/combine.c
index c88382e..4383ce5 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -11510,11 +11510,14 @@ recog_for_combine_1 (rtx *pnewpat, rtx_insn *insn, 
rtx *pnotes)
  them.  Then check to make sure that all of them are dead.  */
   if (num_clobbers_to_add)
 {
+  int nclob = (PERMUTE_MULT (0)
+  ? (num_clobbers_to_add % PERMUTE_MULT (0))
+  : num_clobbers_to_add);
   rtx newpat = gen_rtx_PARALLEL (VOIDmode,
 rtvec_alloc (GET_CODE (pat) == PARALLEL
  ? (XVECLEN (pat, 0)
-+ num_clobbers_to_add)
- : num_clobbers_to_add + 1));
++ nclob)
+ : nclob + 1));
 
   if (GET_CODE (pat) == PARALLEL)
for (i = 0; i < XVECLEN (pat, 0); i++)
@@ -11522,19 +11525,27 @@ recog_for_combine_1 (rtx *pnewpat, rtx_insn *insn, 
rtx *pnotes)
   else
XVECEXP (newpat, 0, 0) = pat;
 
-  add_clobbers (newpat, insn_code_number);
+  if (num_clobbers_to_add != nclob)
+   recog_unpermute_parallel (newpat, &num_clobbers_to_add);
+  gcc_checking_assert (num_clobbers_to_add == nclob);
 
-  for (i = XVECLEN (newpat, 0) - num_clobbers_to_add;
-  i < XVECLEN (newpat, 0); i++)
+  if (num_clobbers_to_add)
{
- if (REG_P (XEXP (XVECEXP (newpat, 0, i), 0))
- && ! reg_dead_at_p (XEXP (XVECEXP (newpat, 0, i), 0), insn))
-   return -1;
- if (GET_CODE (XEXP (XVECEXP (newpat, 0, i), 0)) != SCRATCH)
+ add_clobbers (newpat, insn_code_number);
+
+ for (i = XVECLEN (newpat, 0) - num_clobbers_to_add;
+  i < XVECLEN (newpat, 0); i++)
{
- gcc_assert (REG_P (XEXP (XVECEXP (newpat, 0, i), 0)));
- notes = alloc_reg_note (REG_UNUSED,
-

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Kees Cook via Gcc-patches
On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> On average, all the options starting with “used_…”  (i.e, only the registers 
> that are used in the routine will be zeroed) have very low runtime overheads, 
> at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
> average. 
> Looks like the overhead of zeroing vector registers is much bigger. 
> 
> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
> runtime overhead with this is very small.

That looks great; thanks for doing those tests!

(And it seems like these benchmarks are kind of a "worst case" scenario
with regard to performance, yes? As in it's mostly tight call loops?)

-- 
Kees Cook


Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Alexandre Oliva
On Sep  3, 2020, Segher Boessenkool  wrote:

> On Thu, Sep 03, 2020 at 07:03:53AM -0300, Alexandre Oliva wrote:
>> On Sep  2, 2020, Segher Boessenkool  wrote:
>> > (And aarch, but not the other targets with default clobbers).
>> 
>> And arm.  I didn't look very hard for others yet; I wasn't sure it would
>> be worth pursuing any further.

> Oh, I meant what was in your patch.  From a grep, all of mn10300, nds32,
> arm, cris, pdp11, rs6000, i386, visium, aarch64 have default clobbers.
> Hrm, what you said was the targets with inline asm flag output
> constraints?

I found implicit cc clobbers on x86 only, and then I stumbled across the
same notation used for asm cc-setting in arm/aarch64, but that was all,
I didn't look any further (yet).  I'll look at the other targets with
asm_adjust hooks and see whether they need adjustments for "none", if we
agree that's the way to go.


>> > People write empty asm statements not because they would like no insns
>> > emitted from it, but *because* they want the other effects an asm has
>> > (for example, an empty asm usually has no outputs, so it is volatile,
>> > and that makes sure it is executed in the real machine exactly as often
>> > as in the abstract machine).  So your expectation might be wrong,
>> > someone might want an empty asm to clobber cc on x86 (like any asm is
>> > documented as doing).
>> 
>> I just can't imagine a case in which flags set by an earlier
>> instruction, and that could be usable by a subsequent instruction, could
>> possibly be clobbered by an *empty* sequence of insns.  Maybe I just
>> lack imagination about all the sorts of weird things that people use
>> such asm statements for...  Got any example of something that could be
>> broken by such a change to educate me?  Not that I need it, I'm just
>> curious.

> Before your change, flags could not be live through an asm.

While implementing this patchlet, I learned that standard asms (as
opposed to GCC-extended ones) only undergo machine-dependent adjustments
if their asm pattern string is non-empty.  So there is precedent.  I may
even have a finger on that precedent, from very long ago.

> After your change, it can.  So it does change things...

No doubt it does change things, that's not the point.  The point is how
could it break things.  If the compiler can see flags are set before the
asm, and it can see the flags could be used after the asm, and the asm
has no code whatsoever in it that could possibly clobber the flags...
What could possibly go wrong?  (Famous last words? :-)



>> > But how about a "none" clobber?
>> 
>> I like that!

> Okay, sold!  Now to convince everyone else ;-)

Here's a prototype implementation, x86 only.  (ARM and Thumb do not add
any clobbers by default, so it's already good for "none")  I haven't yet
looked at other targets.  It's missing documentation and tests, too.


enable flags-unchanging asms

---
 gcc/cfgexpand.c|6 ++
 gcc/config/i386/i386.c |2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index b334ea0..f7ae789 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -2986,6 +2986,12 @@ expand_asm_stmt (gasm *stmt)
   const char *regname = TREE_STRING_POINTER (TREE_VALUE (t));
  int nregs, j;
 
+ if (i == 0 && strcmp (regname, "none") == 0)
+   {
+ clobber_rvec.safe_push (const0_rtx);
+ continue;
+   }
+
  j = decode_reg_name_and_count (regname, &nregs);
  if (j < 0)
{
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a15807d..2afff12 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21142,7 +21142,7 @@ ix86_md_asm_adjust (vec &outputs, vec 
&/*inputs*/,
   rtx_insn *seq = get_insns ();
   end_sequence ();
 
-  if (saw_asm_flag)
+  if (saw_asm_flag || (clobbers.length () > 0 && clobbers[0] == const0_rtx))
 return seq;
   else
 {


-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer


Re: [PATCH 1/3] Power10: Add PCREL_OPT load support

2020-09-03 Thread Michael Meissner via Gcc-patches
On Thu, Aug 20, 2020 at 09:09:40PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Aug 18, 2020 at 02:34:01AM -0400, Michael Meissner wrote:
> > +// Maximum number of insns to scan between the load address and the load 
> > that
> 
> Please don't mix /* and // comments.  Just stick to /* comments, like
> all the rest of our backend?
> 
> > +const int MAX_PCREL_OPT_INSNS  = 10;
> 
> "static const", and not tab please (just a space).

With the switch to using flow control, this will go away.

> > +  // LWA is a DS format instruction, but LWZ is a D format instruction.  
> > We use
> > +  // DImode for the mode to force checking whether the bottom 2 bits are 0.
> 
> How so?  Can't you just check if the resulting insn is accepted? That
> would be so much more robust (and can have a correct description more
> easily, too!)

Unfortunately no.  The prefixed form will be accepted, and the whole PCREL_OPT
optimization requires a non-prefixed instruction.

> > +  // However FPR and vector registers uses the LFIWAX instruction which is
> > +  // indexed only.
> 
> (vectors use lxsiwax)

Yes.

> > +  if (GET_CODE (mem) == SIGN_EXTEND && GET_MODE (XEXP (mem, 0)) == SImode)
> 
> You're checking here whether the address of the MEM is SImode.

Whoops.

> > +{
> > +  if (!INT_REGNO_P (reg_regno))
> 
> That should use base_reg_operand instead?
> 
> > +  // The optimization will only work on non-prefixed offsettable loads.
> > +  rtx addr = XEXP (mem_inner, 0);
> > +  enum insn_form iform = address_to_insn_form (addr, mem_mode, 
> > non_prefixed);
> > +  if (iform != INSN_FORM_BASE_REG
> > +  && iform != INSN_FORM_D
> > +  && iform != INSN_FORM_DS
> > +  && iform != INSN_FORM_DQ)
> > +return false;
> 
> Sounds like something for a utility function?  Do we use this elsewhere?

I will make a utility function.  But no, we currently do not use this anywhere
else.  It might be useful for fusion.

> > +  ++pcrel_opt_next_num;
> 
> Normal style is postfix increment.  Empty lines around this are nice, it
> indicates this is the point where we decided to do the thing.  Or add a
> comment even, maybe?
> 
> > +  unsigned int addr_regno = reg_or_subregno (addr_reg);
> > +  rtx label_num = GEN_INT (pcrel_opt_next_num);
> > +  rtx reg_di = gen_rtx_REG (DImode, reg_regno);
> > +
> > +  PATTERN (addr_insn)
> > += ((addr_regno != reg_regno)
> > +   ? gen_pcrel_opt_ld_addr (addr_reg, addr_symbol, label_num, reg_di)
> > +   : gen_pcrel_opt_ld_addr_same_reg (addr_reg, addr_symbol, 
> > label_num));
> 
> Please use if() instead of multi-line expressions.  The outer parens are
> unnecessary, too.

Ok.  However, the outer parens make it easier to format in emacs.

> > +  // Revalidate the insn, backing out of the optimization if the insn is 
> > not
> > +  // supported.
> 
> So the count will be incorrect then?

No, the count is correct, since we don't count the PCREL_OPTs until everything 
is validated.

> 
> > +  INSN_CODE (addr_insn) = recog (PATTERN (addr_insn), addr_insn, 0);
> > +  if (INSN_CODE (addr_insn) < 0)
> > +{
> > +  PATTERN (addr_insn) = addr_set;
> > +  INSN_CODE (addr_insn) = recog (PATTERN (addr_insn), addr_insn, 0);
> > +  return false;
> > +}
> 
> Can you use the normal undo mechanisms, instead?  apply_change_group
> etc.

I will think about it.

> > +static rtx_insn *
> > +next_active_insn_in_basic_block (rtx_insn *insn)
> > +{
> > +  insn = NEXT_INSN (insn);
> > +
> > +  while (insn != NULL_RTX)
> > +{
> > +  /* If the basic block ends or there is a jump of some kind, exit the
> > +loop.  */
> > +  if (CALL_P (insn)
> > + || JUMP_P (insn)
> > + || JUMP_TABLE_DATA_P (insn)
> > + || LABEL_P (insn)
> > + || BARRIER_P (insn))
> > +   return NULL;
> > +
> > +  /* If this is a real insn, return it.  */
> > +  if (!insn->deleted ()
> > + && NONJUMP_INSN_P (insn)
> > + && GET_CODE (PATTERN (insn)) != USE
> > + && GET_CODE (PATTERN (insn)) != CLOBBER)
> > +   return insn;
> > +
> > +  /* Loop for USE, CLOBBER, DEBUG_INSN, NOTEs.  */
> > +  insn = NEXT_INSN (insn);
> > +}
> > +
> > +  return NULL;
> > +}
> 
> There are things to do this.  Please don't write code manually parsing
> RTL unless you have to.

I will switch to using the DF flow in the next set of patches.

> > +// Validate that a load is actually a single instruction that can be 
> > optimized
> > +// with the PCREL_OPT optimization.
> > +
> > +static bool
> > +is_single_instruction (rtx_insn *insn, rtx reg)
> 
> Of course it is a single instruction!  A single RTL instruction...
> Clarify as "single machine instruction"?

My experience is some of the insns lie in terms of things like size.  The size
is set when it is split, but this pass may/may not run before those
instructions are split.  But I will change it to just testing if the length is
4, and hope for the best.

> > +{
> > +  if (!REG_P (reg) && !SUBREG_P (reg))
> > +return false;
> 
>

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches



> On Sep 3, 2020, at 12:13 PM, Kees Cook  wrote:
> 
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>> On average, all the options starting with “used_…”  (i.e, only the registers 
>> that are used in the routine will be zeroed) have very low runtime 
>> overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>> benchmarks. 
>> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
>> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
>> average. 
>> Looks like the overhead of zeroing vector registers is much bigger. 
>> 
>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
>> runtime overhead with this is very small.
> 
> That looks great; thanks for doing those tests!
> 
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)

The top 3 benchmarks that have the most overhead from this option are: 
531.deepsjeng_r, 541.leela_r, and 511.povray_r.
All of them are C++ benchmarks. 
I guess that the most important reason is  the smaller routine size in general 
(especially at the hot execution path or loops).
As a result, the overhead of these additional zeroing instructions in each 
routine will be relatively higher.  

Qing

> 
> -- 
> Kees Cook



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Ramana Radhakrishnan via Gcc-patches
On Thu, Sep 3, 2020 at 6:13 PM Kees Cook via Gcc-patches
 wrote:
>
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> > On average, all the options starting with “used_…”  (i.e, only the 
> > registers that are used in the routine will be zeroed) have very low 
> > runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
> > benchmarks.
> > If all the registers will be zeroed, the runtime overhead is bigger, 
> > all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
> > on average.
> > Looks like the overhead of zeroing vector registers is much bigger.
> >
> > For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
> > the runtime overhead with this is very small.
>
> That looks great; thanks for doing those tests!
>
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)


That's true of some of them but definitely not all - the GCC benchmark
springs to mind in SPEC as having quite a flat profile, so I'd take a
look there and probe a bit more in that one to see what happens. Don't
ask me what else , that's all I have in my cache this evening :)

I'd also query the "average" slowdown metric in those numbers as
something that's being measured in a different way here. IIRC the SPEC
scores for int and FP are computed with a geometric mean of the
individual ratios of each of the benchmark. Thus I don't think the
average of the slowdowns is enough to talk about slowdowns for the
benchmark suite. A quick calculation of the arithmetic mean of column
B in my head suggests that it's the arithmetic mean of all the
slowdowns ?

i.e. Slowdown (Geometric Mean (x, y, z, ))  != Arithmetic mean (
Slowdown (x), Slowdown (y) .)

So another metric to look at would be to look at the Slowdown of your
estimated (probably non-reportable) SPEC scores as well to get a more
"spec like" metric.

regards
Ramana
>
> --
> Kees Cook


Re: [PATCH, rs6000] Fix Vector long long subtype (PR96139)

2020-09-03 Thread Segher Boessenkool
Hi!

On Thu, Sep 03, 2020 at 10:37:33AM -0500, will schmidt wrote:
> On Wed, 2020-09-02 at 05:13 -0500, Segher Boessenkool wrote:
> > On Tue, Sep 01, 2020 at 09:00:20PM -0500, will schmidt wrote:
> > >   This corrects an issue with the powerpc vector long long
> > > subtypes.
> > > As reported by SjMunroe in PR96139.  When building some code with
> > > -Wall
> > > and attempting to print an element of a "long long vector" with a
> > > long long
> > > printf format string, we will report a error because the vector
> > > sub-type
> > > was improperly defined as int.
> > > 
> > > When defining a V2DI_type_node we use a TARGET_POWERPC64 ternary to
> > > define the V2DI_type_node with "vector long" or "vector long long".
> > > We also need to specify the proper sub-type when we define the
> > > type.
> > > -  V2DI_type_node = rs6000_vector_type (TARGET_POWERPC64 ?
> > > "__vector long"
> > > -: "__vector long long",
> > > -intDI_type_node, 2);
> > > +  V2DI_type_node
> > > += rs6000_vector_type (TARGET_POWERPC64
> > > +   ? "__vector long" : "__vector long long",
> > > +   TARGET_POWERPC64
> > > +   ? long_long_integer_type_node :
> > > intDI_type_node,
> > > +   2);
> > 
> > Can't you just use long_long_integer_type_node in all cases?  Or,
> > what
> > else is intDI_type_node for 32 bit?
> 
> I'm not sure offhand.  I'm tending to assume the use of intDI_type_node
> is critical for some underlying reason.I'll give this a spin with
> just long_long_integer_type_node and see what happens.  

If that works, that is okay for trunk (and all backports you want).  If
not, just use what you sent.

Thanks!


Segher


Re: [PATCH] c++: Disable -frounding-math during manifestly constant evaluation [PR96862]

2020-09-03 Thread Jason Merrill via Gcc-patches

On 9/2/20 6:43 PM, Marc Glisse wrote:

On Wed, 2 Sep 2020, Jason Merrill via Gcc-patches wrote:


On 9/1/20 6:13 AM, Marc Glisse wrote:

On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote:


As discussed in the PR, fold-const.c punts on floating point constant
evaluation if the result is inexact and -frounding-math is turned on.
     /* Don't constant fold this floating point 
operation if the
        result may dependent upon the run-time 
rounding mode and
        flag_rounding_math is set, or if GCC's 
software emulation
        is unable to accurately represent the 
result.  */

     if ((flag_rounding_math
          || (MODE_COMPOSITE_P (mode) && 
!flag_unsafe_math_optimizations))
         && (inexact || !real_identical 
(&result, &value)))

       return NULL_TREE;
Jonathan said that we should be evaluating them anyway, e.g. 
conceptually
as if they are done with the default rounding mode before user had a 
chance

to change that, and e.g. in C in initializers it is also ignored.
In fact, fold-const.c for C initializers turns off various other 
options:


/* Perform constant folding and related simplification of initializer
  expression EXPR.  These behave identically to "fold_buildN" 
but ignore
  potential run-time traps and exceptions that fold must 
preserve.  */


#define START_FOLD_INIT \
 int saved_signaling_nans = flag_signaling_nans;\
 int saved_trapping_math = flag_trapping_math;\
 int saved_rounding_math = flag_rounding_math;\
 int saved_trapv = flag_trapv;\
 int saved_folding_initializer = folding_initializer;\
 flag_signaling_nans = 0;\
 flag_trapping_math = 0;\
 flag_rounding_math = 0;\
 flag_trapv = 0;\
 folding_initializer = 1;

#define END_FOLD_INIT \
 flag_signaling_nans = saved_signaling_nans;\
 flag_trapping_math = saved_trapping_math;\
 flag_rounding_math = saved_rounding_math;\
 flag_trapv = saved_trapv;\
 folding_initializer = saved_folding_initializer;

So, shall cxx_eval_outermost_constant_expr instead turn off all those
options (then warning_sentinel wouldn't be the right thing to use, 
but given
the 8 or how many return stmts in cxx_eval_outermost_constant_expr, 
we'd
need a RAII class for this.  Not sure about the 
folding_initializer, that
one is affecting complex multiplication and division constant 
evaluation

somehow.


I don't think we need to turn off flag_signaling_nans or flag_trapv. 
I think we want to turn off flag_trapping_math so we can fold 1./0 to 
inf (still in a context where folding is mandatory). Setting 
folding_initializer seems consistent with that, enabling infinite 
results in complex folding (it also forces folding of 
__builtin_constant_p, which may be redundant with 
force_folding_builtin_constant_p).


C++ says that division by zero has undefined behavior, and that an 
expression with undefined behavior is not constant, so we shouldn't 
fold 1./0 to inf anyway.  The same is true of other trapping 
operations.  So clearing flag_signaling_nans, flag_trapping_math, and 
flag_trapv seems wrong for C++. And folding_initializer seems to be 
used for the same sort of thing.


So we should actually force flag_trapping_math to true during constexpr
evaluation? And folding_initializer to false, and never mind trapv but
maybe disable wrapv?

#include 
constexpr double a = std::numeric_limits::infinity();
constexpr double b = a + a;
constexpr double c = a - a;
constexpr double d = 1. / a;
constexpr double e = 1. / d;

clang rejects c and e. MSVC rejects e. Intel warns on c.

Gcc rejects only e, and accepts the whole thing if I pass
-fno-trapping-math.


At the November 2016 C++ meeting the numerics study group (SG6) said 
about core issue 2168, "arithmetic operations (not conversions) that 
produce infinity are not allowed in a constant expression. Just using 
std::numeric_limits::infinity() is ok, but you can't add 0 to it." 
But the issue is still open.


Some later discussion disputes that this is the right choice, but there 
were voices on both sides and the discussion didn't come to a conclusion.

https://lists.isocpp.org/core/2018/02/3974.php

It seems that clang used to reject expressions that produced infinities 
(as per SG6 above), but that was changed last year to treat the range of 
floating point types as including infinity, so that a floating point 
operation is never considered overflowing.

https://reviews.llvm.org/D63793

C11 says, "The minimum range of representable values for a floating type 
is the most negative finite floating-p

Re: [PATCH v3] c++: Fix P0960 in member init list and array [PR92812]

2020-09-03 Thread Jason Merrill via Gcc-patches

On 9/2/20 6:08 PM, Marek Polacek wrote:

On Wed, Sep 02, 2020 at 05:06:45PM -0400, Jason Merrill wrote:

On 9/2/20 4:37 PM, Marek Polacek wrote:

I've added do_aggregate_paren_init to factor some common code.  It's not
perfect because the enclosing conditions couldn't really be factored out,


This condition:


+  && (list_length (init) > 1
+  /* A single-element list: handle non-standard extensions
+ like compound literals.  This also prevents triggering
+ aggregate ()-initialization in compiler-generated code
+ for =default.  */
+  || (list_length (init) == 1
+  && !same_type_ignoring_top_level_qualifiers_p
+  (type, TREE_TYPE (TREE_VALUE (init))


seems like it could move into do_aggregate_paren_init as well, even if it's
redundant with code in the caller in some cases; we never want to add { } to
a single parenthesized expression of the same type.


True, done in this patch.


And don't we need to
check for the same-type situation for the array case in check_initializer?


Yea, I think so.  It can be reached using either

   char c[4]("foo");
  
which is already handled by the string literal case, or with an array prvalue:


   using T = int[2];
   T t(T{1, 1});


Incidentally, for checking whether a list is length 1 or more, probably
slightly more efficient to look at TREE_CHAIN directly like
do_aggregate_paren_init does in this patch, rather than use list_length.


Done.

How does this look?  Testing in progress, but dg.exp and old-deja.exp
is clean.

Ok if testing passes?


OK, thanks.


-- >8 --
This patch nails down the remaining P0960 case in PR92812:

   struct A {
 int ar[2];
 A(): ar(1, 2) {} // doesn't work without this patch
   };

Note that when the target object is not of array type, this already
works:

   struct S { int x, y; };
   struct A {
 S s;
 A(): s(1, 2) { } // OK in C++20
   };

because build_new_method_call_1 takes care of the P0960 magic.

It proved to be quite hairy.  When the ()-list has more than one
element, we can always create a CONSTRUCTOR, because the code was
previously invalid.  But when the ()-list has just one element, it
gets all kinds of difficult.  As usual, we have to handle a("foo")
so as not to wrap the STRING_CST in a CONSTRUCTOR.  Always turning
x(e) into x{e} would run into trouble as in c++/93790.  Another
issue was what to do about x({e}): previously, this would trigger
"list-initializer for non-class type must not be parenthesized".
I figured I'd make this work in C++20, so that given

   struct S { int x, y; };

you can do

S a[2];
[...]
A(): a({1, 2}) // initialize a[0] with {1, 2} and a[1] with {}

It also turned out that, as an extension, we support compound literals:

   F (): m((S[1]) { 1, 2 })

so this has to keep working as before.  Moreover, make sure not to trigger
in compiler-generated code, like =default, where array assignment is allowed.

I've factored out a function that turns a TREE_LIST into a CONSTRUCTOR
to simplify handling of P0960.

paren-init35.C also tests this with vector types.

gcc/cp/ChangeLog:

PR c++/92812
* cp-tree.h (do_aggregate_paren_init): Declare.
* decl.c (do_aggregate_paren_init): New.
(grok_reference_init): Use it.
(check_initializer): Likewise.
* init.c (perform_member_init): Handle initializing an array from
a ()-list.  Use do_aggregate_paren_init.

gcc/testsuite/ChangeLog:

PR c++/92812
* g++.dg/cpp0x/constexpr-array23.C: Adjust dg-error.
* g++.dg/cpp0x/initlist69.C: Likewise.
* g++.dg/diagnostic/mem-init1.C: Likewise.
* g++.dg/init/array28.C: Likewise.
* g++.dg/cpp2a/paren-init33.C: New test.
* g++.dg/cpp2a/paren-init34.C: New test.
* g++.dg/cpp2a/paren-init35.C: New test.
* g++.old-deja/g++.brendan/crash60.C: Adjust dg-error.
* g++.old-deja/g++.law/init10.C: Likewise.
* g++.old-deja/g++.other/array3.C: Likewise.
---
  gcc/cp/cp-tree.h  |   1 +
  gcc/cp/decl.c |  62 +
  gcc/cp/init.c |  26 +++-
  .../g++.dg/cpp0x/constexpr-array23.C  |   6 +-
  gcc/testsuite/g++.dg/cpp0x/initlist69.C   |   4 +-
  gcc/testsuite/g++.dg/cpp2a/paren-init33.C | 128 ++
  gcc/testsuite/g++.dg/cpp2a/paren-init34.C |  25 
  gcc/testsuite/g++.dg/cpp2a/paren-init35.C |  21 +++
  gcc/testsuite/g++.dg/diagnostic/mem-init1.C   |   4 +-
  gcc/testsuite/g++.dg/init/array28.C   |   2 +-
  .../g++.old-deja/g++.brendan/crash60.C|   2 +-
  gcc/testsuite/g++.old-deja/g++.law/init10.C   |   2 +-
  gcc/testsuite/g++.old-deja/g++.other/array3.C |   3 +-
  13 files changed, 239 insertions(+), 47 deletions(-)
  create mode 100644 gcc/testsuite/g++.dg/cpp2a/paren-init33.C
  creat

Re: [PATCH] c++: Fix another PCH hash_map issue [PR96901]

2020-09-03 Thread Jason Merrill via Gcc-patches

On 9/3/20 4:20 AM, Jakub Jelinek wrote:

Hi!

The recent libstdc++ changes caused lots of libstdc++-v3 tests FAILs
on i686-linux, all of them in the same spot during constexpr evaluation
of a recursive _S_gcd call.
The problem is yet another hash_map that used the default hasing of
tree keys through pointer hashing which is preserved across PCH write/read.
During PCH handling, the addresses of GC objects are changed, which means
that the hash values of the keys in such hash tables change without those
hash tables being rehashed.  Which in the fundef_copies_table case usually
means we just don't find a copy of a FUNCTION_DECL body for recursive uses
and start from scratch.  But when the hash table keeps growing, the "dead"
elements in the hash table can sometimes reappear and break things.
In particular what I saw under the debugger is when the fundef_copies_table
hash map has been used on the outer _S_gcd call, it didn't find an entry for
it, so returned a slot with *slot == NULL, which is treated as that the
function itself is used directly (i.e. no recursion), but that addition of
a hash table slot caused the recursive _S_gcd call to actually find
something in the hash table, unfortunately not the new *slot == NULL spot,
but a different one from the pre-PCH streaming which contained the returned
toplevel (non-recursive) call entry for it, which means that for the
recursive _S_gcd call we actually used the same trees as for the outer ones
rather than a copy of those, which breaks constexpr evaluation.

Fixed thusly, bootstrapped/regtested on x86_64-linux and i686-linux,
ok for trunk (and given that it is latent eventually for release branches
too)?


OK.


2020-09-03  Jakub Jelinek  

PR c++/96901
* tree.h (struct decl_tree_traits): New type.
(decl_tree_map): New typedef.

* constexpr.c (fundef_copies_table): Change type from
hash_map * to decl_tree_map *.

--- gcc/tree.h.jj   2020-08-31 22:50:28.545593391 +0200
+++ gcc/tree.h  2020-09-02 17:44:45.478358927 +0200
@@ -5453,6 +5453,11 @@ struct type_tree_cache_traits
: simple_cache_map_traits { };
  typedef hash_map type_tree_cache_map;
  
+/* Similarly to decl_tree_cache_map, but without caching.  */

+struct decl_tree_traits
+  : simple_hashmap_traits { };
+typedef hash_map decl_tree_map;
+
  /* Initialize the abstract argument list iterator object ITER with the
 arguments from CALL_EXPR node EXP.  */
  static inline void
--- gcc/cp/constexpr.c.jj   2020-09-01 22:47:49.946383670 +0200
+++ gcc/cp/constexpr.c  2020-09-02 17:46:35.267758742 +0200
@@ -1203,7 +1203,7 @@ maybe_initialize_constexpr_call_table (v
  
 This is not GC-deletable to avoid GC affecting UID generation.  */
  
-static GTY(()) hash_map *fundef_copies_table;

+static GTY(()) decl_tree_map *fundef_copies_table;
  
  /* Reuse a copy or create a new unshared copy of the function FUN.

 Return this copy.  We use a TREE_LIST whose PURPOSE is body, VALUE

Jakub





Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Segher Boessenkool
On Thu, Sep 03, 2020 at 02:07:24PM -0300, Alexandre Oliva wrote:
> On Sep  3, 2020, Segher Boessenkool  wrote:
> > On Thu, Sep 03, 2020 at 07:03:53AM -0300, Alexandre Oliva wrote:
> >> On Sep  2, 2020, Segher Boessenkool  wrote:
> >> >> we might succeed, but only if we had a pattern
> >> >> that matched add3_cc_overflow_1's parallel with the flag-setter as
> >> >> the second element of the parallel, because that's where combine adds it
> >> >> to the new i3 pattern, after splitting it out of i2.
> >> 
> >> > That sounds like the backend pattern has it wrong then?  There is a
> >> > canonical order for this?
> >> 
> >> Much as I can tell, there isn't, it's just an arbitrary choice of
> >> backends, some do it one way or the other, and that causes combine to be
> >> able to perform some combinations but not others.
> 
> > For instructions that inherently set a condition code register, the
> > @code{compare} operator is always written as the first RTL expression of
> > the @code{parallel} instruction pattern.
> 
> Interesting.  I'm pretty sure I read email recently that suggested it
> was really up to the port, but I've caught up with GCC emails from years
> ago, so that might have been it.  Or I just misremember.  Whatever.

I think you remember right.  But combine depends on the documented
order, and so does compare-elim (since 4f0473fe89e6), so now the
documented order is always the only wanted one.

> The x86 pattern that fails to match in combine has the flags setter
> first, but combine places it second, after splitting it out of i2 and
> then appending it back to i3.

What does that RTL look like exactly?  This canonical form is only for
a set of the flags as a compare to 0 of what the other set sets (hrm, I
hope you can make sense of that).

> Alas, it would be just as legitimate for combine to go the opposite way,
> substituting the flags set into another insn, and then tacking the other
> set onto the substituted-into insn.

Combine always generates the canonical form (for this, anyway; and it is
a missed optimisation bug if it makes something non-canonical anywhere).

Do you have a simple testcase?  Or a -fdump-rtl-combine-all dump.


Segher


[PATCH, committed] PR fortran/96890 - Wrong answer with intrinsic IALL

2020-09-03 Thread Harald Anlauf
Committed as obvious after regtesting.

The m4 template for the IALL library functions erroneously had a 0
as initial value for the case when the DIM and MASK arguments were
present instead of a -1.  This was unfortunately not tested in the
testsuite before.  Fixed and testcase added.

Thanks,
Harald


PR fortran/96890 - Wrong answer with intrinsic IALL

The IALL intrinsic would always return 0 when the DIM and MASK arguments
were present since the initial value of repeated BIT-AND operations was
set to 0 instead of -1.

libgfortran/ChangeLog:

* m4/iall.m4: Initial value for result should be -1.
* generated/iall_i1.c (miall_i1): Generated.
* generated/iall_i16.c (miall_i16): Likewise.
* generated/iall_i2.c (miall_i2): Likewise.
* generated/iall_i4.c (miall_i4): Likewise.
* generated/iall_i8.c (miall_i8): Likewise.

gcc/testsuite/ChangeLog:

* gfortran.dg/iall_masked.f90: New test.

diff --git a/gcc/testsuite/gfortran.dg/iall_masked.f90 b/gcc/testsuite/gfortran.dg/iall_masked.f90
new file mode 100644
index 000..33cc4106a1b
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/iall_masked.f90
@@ -0,0 +1,22 @@
+! { dg-do run }
+! PR fortran/96890 - Wrong answer with intrinsic IALL
+program p
+  implicit none
+  integer :: iarr1(0), iarr2(2,2), iarr3(2,2,2)
+  logical :: mask1(0), mask2(2,2), mask3(2,2,2)
+
+  if ( iall(iarr1,mask1) /=  -1 ) stop 1
+  if ( iall(iarr1, 1, mask1) /=  -1 ) stop 2
+
+  iarr2 = reshape ([  1,  2,   3,  4  ], shape (iarr2))
+  mask2 = reshape ([ .true., .false., .true., .false. ], shape (mask2))
+
+  if (any (iall(iarr2, 2, mask2) /=  [1,-1])) stop 3
+
+  iarr3 = reshape ([  1,  2,   3,  4, &
+  5,  6,   7,  8  ], shape (iarr3))
+  mask3 = reshape ([ .true., .false., .true., .false.,&
+ .true., .false., .true., .false. ], shape (iarr3))
+
+  if (any (iall(iarr3, 2, mask3) /= reshape ([1,-1,5,-1],[2,2]))) stop 4
+end
diff --git a/libgfortran/generated/iall_i1.c b/libgfortran/generated/iall_i1.c
index 3fe0a1698ad..086a5464aad 100644
--- a/libgfortran/generated/iall_i1.c
+++ b/libgfortran/generated/iall_i1.c
@@ -345,7 +345,7 @@ miall_i1 (gfc_array_i1 * const restrict retarray,
   msrc = mbase;
   {

-  result = 0;
+  result = (GFC_INTEGER_1) -1;
 	for (n = 0; n < len; n++, src += delta, msrc += mdelta)
 	  {

diff --git a/libgfortran/generated/iall_i16.c b/libgfortran/generated/iall_i16.c
index 35d9872c0e9..c491414ca7c 100644
--- a/libgfortran/generated/iall_i16.c
+++ b/libgfortran/generated/iall_i16.c
@@ -345,7 +345,7 @@ miall_i16 (gfc_array_i16 * const restrict retarray,
   msrc = mbase;
   {

-  result = 0;
+  result = (GFC_INTEGER_16) -1;
 	for (n = 0; n < len; n++, src += delta, msrc += mdelta)
 	  {

diff --git a/libgfortran/generated/iall_i2.c b/libgfortran/generated/iall_i2.c
index ef90119341f..d43e5df1809 100644
--- a/libgfortran/generated/iall_i2.c
+++ b/libgfortran/generated/iall_i2.c
@@ -345,7 +345,7 @@ miall_i2 (gfc_array_i2 * const restrict retarray,
   msrc = mbase;
   {

-  result = 0;
+  result = (GFC_INTEGER_2) -1;
 	for (n = 0; n < len; n++, src += delta, msrc += mdelta)
 	  {

diff --git a/libgfortran/generated/iall_i4.c b/libgfortran/generated/iall_i4.c
index 27140abeaa8..039e7963798 100644
--- a/libgfortran/generated/iall_i4.c
+++ b/libgfortran/generated/iall_i4.c
@@ -345,7 +345,7 @@ miall_i4 (gfc_array_i4 * const restrict retarray,
   msrc = mbase;
   {

-  result = 0;
+  result = (GFC_INTEGER_4) -1;
 	for (n = 0; n < len; n++, src += delta, msrc += mdelta)
 	  {

diff --git a/libgfortran/generated/iall_i8.c b/libgfortran/generated/iall_i8.c
index 6047169c62e..d01f7aecaf8 100644
--- a/libgfortran/generated/iall_i8.c
+++ b/libgfortran/generated/iall_i8.c
@@ -345,7 +345,7 @@ miall_i8 (gfc_array_i8 * const restrict retarray,
   msrc = mbase;
   {

-  result = 0;
+  result = (GFC_INTEGER_8) -1;
 	for (n = 0; n < len; n++, src += delta, msrc += mdelta)
 	  {

diff --git a/libgfortran/m4/iall.m4 b/libgfortran/m4/iall.m4
index df57367c100..8f3b7741486 100644
--- a/libgfortran/m4/iall.m4
+++ b/libgfortran/m4/iall.m4
@@ -35,7 +35,7 @@ ARRAY_FUNCTION(0,
 `  result &= *src;')

 MASKED_ARRAY_FUNCTION(0,
-`  result = 0;',
+`  result = ('rtype_name`) -1;',
 `  if (*msrc)
 result &= *src;')



*Ping*: [PATCH] PR fortran/96711 - ICE on NINT() Function

2020-09-03 Thread Harald Anlauf
*ping*

> Gesendet: Montag, 24. August 2020 um 23:08 Uhr
> Von: "Harald Anlauf" 
> An: "fortran" , "gcc-patches" 
> Betreff: [PATCH] PR fortran/96711 - ICE on NINT() Function
>
> When rounding a real to integer(16) an ICE happened due to an unhandled
> case in build_round_expr.  The attached patch add a special case for this.
>
> I had to change a fold_convert to a convert (which seems to be frontend
> specific stuff), otherwise I would get errors from the generated GIMPLE.
> Does anybody know if this could produce problems elsewhere?
> I'm open to other / better solutions.
>
> Anyway, the patch does regtest cleanly on x86_64-pc-linux-gnu.
>
> OK for master?
>
> Thanks,
> Harald
>
>
> PR fortran/96711 - ICE with NINT() for integer(16) result
>
> When rounding a real to the nearest integer, temporarily convert the real
> argument to a longer real kind when the result is of type/kind integer(16).
>
> gcc/fortran/ChangeLog:
>
>   * trans-intrinsic.c (build_round_expr): Use temporary with
>   appropriate kind for conversion before rounding to nearest
>   integer when the result precision is 128 bits.
>
> gcc/testsuite/ChangeLog:
>
>   * gfortran.dg/pr96711.f90: New test.
>
>


Re: [PATCH] correct handling of indices into arrays with elements larger than 1 (PR c++/96511)

2020-09-03 Thread Martin Sebor via Gcc-patches

On 9/1/20 1:22 PM, Jason Merrill wrote:

On 8/11/20 12:19 PM, Martin Sebor via Gcc-patches wrote:

-Wplacement-new handles array indices and pointer offsets the same:
by adjusting them by the size of the element.  That's correct for
the latter but wrong for the former, causing false positives when
the element size is greater than one.

In addition, the warning doesn't even attempt to handle arrays of
arrays.  I'm not sure if I forgot or if I simply didn't think of
it.

The attached patch corrects these oversights by replacing most
of the -Wplacement-new code with a call to compute_objsize which
handles all this correctly (plus more), and is also better tested.
But even compute_objsize has bugs: it trips up while converting
wide_int to offset_int for some pointer offset ranges.  Since
handling the C++ IL required changes in this area the patch also
fixes that.

For review purposes, the patch affects just the middle end.
The C++ diff pretty much just removes code from the front end.


The C++ changes are OK.


Thank you for looking at the rest as well.




-compute_objsize (tree ptr, int ostype, access_ref *pref,
-    bitmap *visited, const vr_values *rvals /* = NULL */)
+compute_objsize (tree ptr, int ostype, access_ref *pref, bitmap 
*visited,

+    const vr_values *rvals)


This reformatting seems unnecessary, and I prefer to keep the comment 
about the default argument.


This overload doesn't take a default argument.  (There was a stray
declaration of a similar function at the top of the file that had
one.  I've removed it.)




-  if (!size || TREE_CODE (size) != INTEGER_CST)
-   return false;

 >...

You change some failure cases in compute_objsize to return success with 
a maximum range, while others continue to return failure.  This needs 
commentary about the design rationale.


This is too much for a comment in the code but the background is
this: compute_objsize initially returned the object size as a constant.
Recently, I have enhanced it to return a range to improve warnings for
allocated objects.  With that, a failure can be turned into success by
having the function set the range to that of the largest object.  That
should simplify the function's callers and could even improve
the detection of some invalid accesses.  Once this change is made
it might even be possible to change its return type to void.

The change that caught your eye is necessary to make the function
a drop-in replacement for the C++ front end code which makes this
same assumption.  Without it, a number of test cases that exercise
VLAs fail in g++.dg/warn/Wplacement-new-size-5.C.  For example:

  void f (int n)
  {
char a[n];
new (a - 1) int ();
  }

Changing any of the other places isn't necessary for existing tests
to pass (and I didn't want to introduce too much churn).  But I do
want to change the rest of the function along the same lines at some
point.

I've tweaked the comment a little bit without going into all this
detail.




+  special_array_member sam{ };


sam is always set by component_ref_size, so I don't think it's necessary 
to initialize it at the declaration.


I find initializing pass-by-pointer local variables helpful but
I don't insist on it.




@@ -187,7 +187,7 @@ decl_init_size (tree decl, bool min)
   tree last_type = TREE_TYPE (last);
   if (TREE_CODE (last_type) != ARRAY_TYPE
   || TYPE_SIZE (last_type))
-    return size;
+    return size ? size : TYPE_SIZE_UNIT (type);


This change seems to violate the comment for the function.


By my reading (and writing) the change is covered by the first
sentence:

   Returns the size of the object designated by DECL considering
   its initializer if it either has one or if it would not affect
   its size, ...

It handles a number of cases in Wplacement-new-size.C fail that
construct a larger object in an extern declaration of a template,
like this:

  template  struct S { char c; };
  extern S s;

  void f ()
  {
new (&s) int ();
  }

I don't know why DECL_SIZE isn't set here (I don't think it can
be anything but equal to TYPE_SIZE, can it?) and other than struct
objects with a flexible array member where this identity doesn't
hold I can't think of others.  Am I missing something?

Attached is an updated patch rebased on top of the current trunk
and retested on x86_64-linux + by building Glibc.

Martin
Correct handling of indices into arrays with elements larger than 1 (PR c++/96511)

Resolves:
PR c++/96511 - Incorrect -Wplacement-new on POINTER_PLUS into an array with 4-byte elements
PR middle-end/96384 - bogus -Wstringop-overflow= storing into multidimensional array with index in range

gcc/ChangeLog:

	PR c++/96511
	PR middle-end/96384
	* builtins.c (get_range): Return full range of type when neither
	value nor its range is available.  Fail for ranges inverted due
	to the signedness of offsets.
	(compute_objsize): Handle more special array members.  Handle
	POINTER_PLUS_EXPR and VIEW_CONVERT_EXPR that come u

Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Segher Boessenkool
On Thu, Sep 03, 2020 at 02:17:04PM -0300, Alexandre Oliva wrote:
> On Sep  3, 2020, Segher Boessenkool  wrote:
> > Oh, I meant what was in your patch.  From a grep, all of mn10300, nds32,
> > arm, cris, pdp11, rs6000, i386, visium, aarch64 have default clobbers.
> > Hrm, what you said was the targets with inline asm flag output
> > constraints?
> 
> I found implicit cc clobbers on x86 only, and then I stumbled across the
> same notation used for asm cc-setting in arm/aarch64, but that was all,
> I didn't look any further (yet).

mn10300, cris, pdp11 (twice), i386, visium at least.

> I'll look at the other targets with
> asm_adjust hooks and see whether they need adjustments for "none", if we
> agree that's the way to go.

The idea is that none of them will need adjustment.  This hook runs
before the "none" code will run, and it can just clear all clobbers
then.

> > Before your change, flags could not be live through an asm.
> 
> While implementing this patchlet, I learned that standard asms (as
> opposed to GCC-extended ones) only undergo machine-dependent adjustments
> if their asm pattern string is non-empty.  So there is precedent.  I may
> even have a finger on that precedent, from very long ago.

Exceptions can look reasonable and easy, until you have to document it,
and then write down how it interacts with other stuff :-/

> > After your change, it can.  So it does change things...
> 
> No doubt it does change things, that's not the point.  The point is how
> could it break things.  If the compiler can see flags are set before the
> asm, and it can see the flags could be used after the asm, and the asm
> has no code whatsoever in it that could possibly clobber the flags...
> What could possibly go wrong?  (Famous last words? :-)

The user could want to use an empty asm to influence what code is
generated.  Yes, I know this isn't a very strong argument :-)

> >> > But how about a "none" clobber?
> >> 
> >> I like that!
> 
> > Okay, sold!  Now to convince everyone else ;-)
> 
> Here's a prototype implementation, x86 only.  (ARM and Thumb do not add
> any clobbers by default, so it's already good for "none")  I haven't yet
> looked at other targets.  It's missing documentation and tests, too.

I think we can just do this in expand_asm_stmt?  It makes a first pass
over the clobbers as well, before actually expanding things, so we could
just set a flag and not call md_asm_adjust?  Maybe "nodefault" is a
better name than "none" btw (must be close to getting it done, if we are
picking colours for the bikeshed :-) )

Or, hrm, doing it exactly like that will disable all =@xxx as well.
That doesn't matter for x86, and probably nowhere in practice, but eww.

Kill all clobbers immediately after calling md_asm_adjust?


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches



> On Sep 3, 2020, at 12:48 PM, Ramana Radhakrishnan  
> wrote:
> 
> On Thu, Sep 3, 2020 at 6:13 PM Kees Cook via Gcc-patches
>  wrote:
>> 
>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>>> On average, all the options starting with “used_…”  (i.e, only the 
>>> registers that are used in the routine will be zeroed) have very low 
>>> runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>>> benchmarks.
>>> If all the registers will be zeroed, the runtime overhead is bigger, 
>>> all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
>>> on average.
>>> Looks like the overhead of zeroing vector registers is much bigger.
>>> 
>>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
>>> the runtime overhead with this is very small.
>> 
>> That looks great; thanks for doing those tests!
>> 
>> (And it seems like these benchmarks are kind of a "worst case" scenario
>> with regard to performance, yes? As in it's mostly tight call loops?)
> 
> 
> That's true of some of them but definitely not all - the GCC benchmark
> springs to mind in SPEC as having quite a flat profile, so I'd take a
> look there and probe a bit more in that one to see what happens. Don't
> ask me what else , that's all I have in my cache this evening :)
> 
> I'd also query the "average" slowdown metric in those numbers as
> something that's being measured in a different way here. IIRC the SPEC
> scores for int and FP are computed with a geometric mean of the
> individual ratios of each of the benchmark. Thus I don't think the
> average of the slowdowns is enough to talk about slowdowns for the
> benchmark suite. A quick calculation of the arithmetic mean of column
> B in my head suggests that it's the arithmetic mean of all the
> slowdowns ?
> 
> i.e. Slowdown (Geometric Mean (x, y, z, ))  != Arithmetic mean (
> Slowdown (x), Slowdown (y) .)
> 
> So another metric to look at would be to look at the Slowdown of your
> estimated (probably non-reportable) SPEC scores as well to get a more
> "spec like" metric.

Please take a look at the new csv file at:

https://gitlab.com/x86-gcc/gcc/-/wikis/Zero-call-used-registers-data 


I just uploaded the slowdown data computed based on 
Est.SPECrate(R)2017_int_base and Est.SPECrate(R)2017_fp_base. All data are 
computed against “no”. 

Compare this slowdown data to the one I computed previously as (Arithmetic mean 
(Slowdown(x), slowdown(y)…), the numbers do change a little bit, however, the 
basic information provided from the data keeps the same as before. 

Let me know if you have further comments.

thanks.

Qing


> 
> regards
> Ramana
>> 
>> --
>> Kees Cook



Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Alexandre Oliva
On Sep  3, 2020, Segher Boessenkool  wrote:

> On Thu, Sep 03, 2020 at 02:07:24PM -0300, Alexandre Oliva wrote:
>> On Sep  3, 2020, Segher Boessenkool  wrote:
>> > On Thu, Sep 03, 2020 at 07:03:53AM -0300, Alexandre Oliva wrote:
>> >> On Sep  2, 2020, Segher Boessenkool  wrote:
>> >> >> we might succeed, but only if we had a pattern
>> >> >> that matched add3_cc_overflow_1's parallel with the flag-setter 
>> >> >> as
>> >> >> the second element of the parallel, because that's where combine adds 
>> >> >> it
>> >> >> to the new i3 pattern, after splitting it out of i2.
>> >> 
>> >> > That sounds like the backend pattern has it wrong then?  There is a
>> >> > canonical order for this?
>> >> 
>> >> Much as I can tell, there isn't, it's just an arbitrary choice of
>> >> backends, some do it one way or the other, and that causes combine to be
>> >> able to perform some combinations but not others.
>> 
>> > For instructions that inherently set a condition code register, the
>> > @code{compare} operator is always written as the first RTL expression of
>> > the @code{parallel} instruction pattern.
>> 
>> Interesting.  I'm pretty sure I read email recently that suggested it
>> was really up to the port, but I've caught up with GCC emails from years
>> ago, so that might have been it.  Or I just misremember.  Whatever.

> I think you remember right.  But combine depends on the documented
> order

Except when it doesn't ;-)

Under:

  /* If the actions of the earlier insns must be kept
 in addition to substituting them into the latest one,
 we must make a new PARALLEL for the latest insn
 to hold additional the SETs.  */

it turns the RMW add pattern into a PARALLEL and adds to it i0's
flags-setter, that had been split out of i2.  The PARALLEL 
has the flag-setter as the second in the parallel, as you can see below,
and nothing (other than my patch) attempts to change that order.

> What does that RTL look like exactly?

Trying 5, 7 -> 15:
5: r91:SI=zero_extend([`i'])
7: {flags:CCC=cmp(r91:SI#0+0x1,r91:SI#0);r92:QI=r91:SI#0+0x1;}
  REG_DEAD r91:SI
   15: [`i']=r92:QI
  REG_DEAD r92:QI
Successfully matched this instruction:
(parallel [
(set (mem/c:QI (symbol_ref:DI ("i") [flags 0x2]  ) [0 i+0 S1 A8])
(plus:QI (mem/c:QI (symbol_ref:DI ("i") [flags 0x2]  ) [0 i+0 S1 A8])
(const_int 1 [0x1])))
(set (reg:CCC 17 flags)
(compare:CCC (plus:QI (mem/c:QI (symbol_ref:DI ("i") [flags 0x2]  
) [0 i+0 S1 A8])
(const_int 1 [0x1]))
(mem/c:QI (symbol_ref:DI ("i") [flags 0x2]  ) [0 i+0 S1 A8])))
])

The reason it successfully matched in this combine dump is that the
above had the permuting recog for combine that I posted earlier enabled.
Without that patch, it wouldn't be recognized, and ended up rejected.

> Do you have a simple testcase?  Or a -fdump-rtl-combine-all dump.

Not so simple, I'm afraid.  I mean, the testcase is simple, but
reproducing this behavior isn't.  The small testcase I'm using only gets
to that point if:

- we allow flags to pass through extended asms, as in the other patch

- I arrange for combine to give the above set of insns a second shot,
  after the removal of the intervening flag-store insn, that prevents
  the combination in the first try

typedef unsigned char T;
T i;
int inc ()
{
  T c = __builtin_add_overflow (i, 1, &i);
  asm ("" : "+m" (i) : : "none");
  i += c;
}


-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer


Re: [RFC] enable flags-unchanging asms, add_overflow/expand/combine woes

2020-09-03 Thread Alexandre Oliva
On Sep  3, 2020, Segher Boessenkool  wrote:

> On Thu, Sep 03, 2020 at 02:17:04PM -0300, Alexandre Oliva wrote:
>> On Sep  3, 2020, Segher Boessenkool  wrote:
>> > Oh, I meant what was in your patch.  From a grep, all of mn10300, nds32,
>> > arm, cris, pdp11, rs6000, i386, visium, aarch64 have default clobbers.
>> > Hrm, what you said was the targets with inline asm flag output
>> > constraints?
>> 
>> I found implicit cc clobbers on x86 only, and then I stumbled across the
>> same notation used for asm cc-setting in arm/aarch64, but that was all,
>> I didn't look any further (yet).

> mn10300, cris, pdp11 (twice), i386, visium at least.

Thanks, yeah, I was hoping to get the general approach sorted out before
going through all ports.

> The idea is that none of them will need adjustment.  This hook runs
> before the "none" code will run, and it can just clear all clobbers
> then.

Uhh...  That's not how asm expansion works.  We process inputs, outputs,
clobbers and labels first, *then* we pass all the cooked ops to the
target hook, so that it gets a chance to, well, adjust stuff.

I don't think clobber "none" should disable target adjustments; IMHO, it
should be recognized by target adjustments, and cause them to not add
stuff they normally would.

> The user could want to use an empty asm to influence what code is
> generated.  Yes, I know this isn't a very strong argument :-)

Well, it is to me, it's exactly what I'm doing here, you know? :-)
But it's not the kind of "things going wrong" that I was looking for.

> just set a flag and not call md_asm_adjust?

I don't like that.  When we move some asm processing between
target-specific and generic, in either direction, suddenly "none"
affects whether or not the processing gets done.  I don't like that.

> Maybe "nodefault" is a
> better name than "none" btw

For that, I agree, but I don't think that's a good feature to have.  I'd
rather have "none" and keep the ability to introduce target-specific
adjustments à-la "=@cc*".  There's no good reason IMHO to have that hook
disabled by "none" or by "nodefault".

> (must be close to getting it done, if we are picking colours for the
> bikeshed :-) )

:-D

> Kill all clobbers immediately after calling md_asm_adjust?

Why?  We don't want to discard any clobbers mentioned by the user, and
we (well, I :-) even want to give the target hook a chance to
adjust/remove any of those clobbers.  That requires the target hook to
be aware of the presence of "none", which is fine with me.

-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer


Re: ubsan: d-demangle.c:214 signed integer overflow

2020-09-03 Thread Iain Buclaw via Gcc-patches
Excerpts from Alan Modra's message of September 3, 2020 3:01 pm:
> Running the libiberty testsuite
> ./test-demangle < libiberty/testsuite/d-demangle-expected
> libiberty/d-demangle.c:214:14: runtime error: signed integer overflow: 
> 922337203 * 10 cannot be represented in type 'long int'
> 
> On looking at silencing ubsan, I found a real bug in dlang_number.
> For a 32-bit long, some overflows won't be detected.  For example,
> 21474836480.  Why?  Well 214748364 * 10 is 0x7FF8 (no overflow so
> far).  Adding 8 gives 0x8000 (which does overflow but there is no
> test for that overflow in the code).  Then multiplying 0x8000 * 10
> = 0x5 = 0 won't be caught by the multiplication overflow test.
> The same holds for a 64-bit long using similarly crafted digit
> sequences.
> 
> This patch replaces the mod 10 test with a simpler limit test, and
> similarly the mod 26 test in dlang_decode_backref.
> 
> About the limit test:
>   val * 10 + digit > ULONG_MAX is the condition for overflow
> ie.
>   val * 10 > ULONG_MAX - digit
> or
>   val > (ULONG_MAX - digit) / 10
> or assuming the largest digit
>   val > (ULONG_MAX - 9) / 10
> 
> I resisted the aesthetic appeal of simplifying this further to
>   val > -10UL / 10
> since -1UL for ULONG_MAX is only correct for 2's complement numbers.
> 
> Passes all the libiberty tests, on both 32-bit and 64-bit hosts.  OK
> to apply?
> 

Thanks Alan, change seems reasonable, however on giving it a mull over,
I see that the largest number that dlang_number would need to be able to
handle is UINT_MAX.  These two tests which decode a wchar value are
representative of that (first is valid, second invalid).

#
--format=dlang
_D4test21__T3funVwi4294967295Z3funFNaNbNiNfZv
test.fun!('\U').fun()
#
--format=dlang
_D4test21__T3funVwi4294967296Z3funFNaNbNiNfZv
_D4test21__T3funVwi4294967296Z3funFNaNbNiNfZv

I'm fine with creating a new PR and dealing with the above in a separate
change though, as it will require a few more replacements to adjust the
result parameter type to 'unsigned' or 'long long'.  

Iain.


>   * d-demangle.c: Include limits.h.
>   (ULONG_MAX): Provide fall-back definition.
>   (dlang_number): Simplify and correct overflow test.  Only
>   write *ret on returning non-NULL.
>   (dlang_decode_backref): Likewise.
> 
> diff --git a/libiberty/d-demangle.c b/libiberty/d-demangle.c
> index f2d6946eca..59e6ae007a 100644
> --- a/libiberty/d-demangle.c
> +++ b/libiberty/d-demangle.c
> @@ -31,6 +31,9 @@ If not, see .  */
>  #ifdef HAVE_CONFIG_H
>  #include "config.h"
>  #endif
> +#ifdef HAVE_LIMITS_H
> +#include 
> +#endif
>  
>  #include "safe-ctype.h"
>  
> @@ -45,6 +48,10 @@ If not, see .  */
>  #include 
>  #include "libiberty.h"
>  
> +#ifndef ULONG_MAX
> +#define  ULONG_MAX   (~0UL)
> +#endif
> +
>  /* A mini string-handling package */
>  
>  typedef struct string/* Beware: these aren't required to be 
> */
> @@ -207,24 +214,24 @@ dlang_number (const char *mangled, long *ret)
>if (mangled == NULL || !ISDIGIT (*mangled))
>  return NULL;
>  
> -  (*ret) = 0;
> +  unsigned long val = 0;
>  
>while (ISDIGIT (*mangled))
>  {
> -  (*ret) *= 10;
> -
> -  /* If an overflow occured when multiplying by ten, the result
> -  will not be a multiple of ten.  */
> -  if ((*ret % 10) != 0)
> +  /* Check for overflow.  Yes, we return NULL here for some digits
> +  that don't overflow "val * 10 + digit", but that doesn't
> +  matter given the later "(long) val < 0" test.  */
> +  if (val > (ULONG_MAX - 9) / 10)
>   return NULL;
>  
> -  (*ret) += mangled[0] - '0';
> +  val = val * 10 + mangled[0] - '0';
>mangled++;
>  }
>  
> -  if (*mangled == '\0' || *ret < 0)
> +  if (*mangled == '\0' || (long) val < 0)
>  return NULL;
>  
> +  *ret = val;
>return mangled;
>  }
>  
> @@ -294,24 +301,24 @@ dlang_decode_backref (const char *mangled, long *ret)
>   [A-Z] NumberBackRef
>   ^
> */
> -  (*ret) = 0;
> +  unsigned long val = 0;
>  
>while (ISALPHA (*mangled))
>  {
> -  (*ret) *= 26;
> +  /* Check for overflow.  */
> +  if (val > (ULONG_MAX - 25) / 26)
> + break;
>  
> -  /* If an overflow occured when multiplying by 26, the result
> -  will not be a multiple of 26.  */
> -  if ((*ret % 26) != 0)
> - return NULL;
> +  val *= 26;
>  
>if (mangled[0] >= 'a' && mangled[0] <= 'z')
>   {
> -   (*ret) += mangled[0] - 'a';
> +   val += mangled[0] - 'a';
> +   *ret = val;
> return mangled + 1;
>   }
>  
> -  (*ret) += mangled[0] - 'A';
> +  val += mangled[0] - 'A';
>mangled++;
>  }
>  
> -- 
> Alan Modra
> Australia Development Lab, IBM
> 


Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling

2020-09-03 Thread Segher Boessenkool
On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote:
> on 2020/9/2 下午6:25, Segher Boessenkool wrote:
> > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote:
> >> on 2020/9/1 上午3:41, Segher Boessenkool wrote:
> >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote:
>  1) Currently address_cost hook on rs6000 always return zero, but at least
>  from Power7, pre_inc/pre_dec kind instructions are cracked, it means we
>  have to take the address update into account (scalar normal operation).
> >>>
> >>> From Power4 on already (not sure about Power6, but does anyone care?)
> >>
> >> Thanks for the information, it looks this issue exists for a long time.
> > 
> > Well, *is* it an issue?  The addressing doesn't get more expensive...
> > For example, an
> >   ldu 3,16(4)
> > is cracked to an
> >   ld 3,16(4)
> > and an
> >   addi 4,4,16
> > (the addi is not on the critical path of the load).  So it seems to me
> > this shouldn't increase the addressing cost at all?  (The instruction of
> > course is really two insns in one.)
> 
> Good question!  I agree that they can execute in parallel, but it depends
> on how we interprete the addressing cost, if it's for required execution
> resource, I think it's off, since comparing with ld, the ldu has two iops
> and extra ALU requirement.

OTOH, if you do not use an ldu you need to use a real addi insn, which
gives you all the same cost (plus it takes more code space and decode etc.
resources).

> I'm not sure its usage elsewhere, but in the
> context of IVOPTs on Power, for one normal candidate, its step cost is 4,
> the cost for group (1) is zero, total cost is 4 for this combination.
> for the scenario like:
> ldx rx, iv // (1)
> ...
> iv = iv + step // (2)
> 
> While for ainc_use candidate (like ldu), its step cost is 4, but the cost
> for group (1) is (-4 // minus step cost), total cost is 0.  It looks to
> say the step update is free.

That seems wrong, but the address_cost is used in more places, that is
not where to fix this?

> We can also see (1) and (2) can also execute in parallel (same iteration).
> If we consider the next iteration, it will have the dependency, but it's
> the same for ldu.  So basically they are similar, I think it's unfair to
> have this difference in the cost modeling.  The cracked addi should have
> its cost here.  Does it make sense?

It should have cost, certainly, but not address_cost I think.  The total
cost of an ldu should be a tiny bit less than that of ld + that of addi;
the address_cost of ldu should be the same as that of ld.

> Apart from that, one P9 specific point is that the update form load isn't
> preferred,  the reason is that the instruction can not retire until both
> parts complete, it can hold up subsequent instructions from retiring.
> If the addi stalls (starvation), the instruction can not retire and can
> cause things stuck.  It seems also something we can model here?

This is (almost) no problem on p9, since we no longer have issue groups.
It can hold up older insns from retiring, sure, but they *will* have
finished, and p9 can retire 64 insns per cycle.  The "completion wall"
is gone.  The only problem is if things stick around so long that
resources run out...  but you're talking 100s of insns there.


Segher


Re: ubsan: d-demangle.c:214 signed integer overflow

2020-09-03 Thread Alan Modra via Gcc-patches
On Thu, Sep 03, 2020 at 11:02:50PM +0200, Iain Buclaw wrote:
> Excerpts from Alan Modra's message of September 3, 2020 3:01 pm:
> > Running the libiberty testsuite
> > ./test-demangle < libiberty/testsuite/d-demangle-expected
> > libiberty/d-demangle.c:214:14: runtime error: signed integer overflow: 
> > 922337203 * 10 cannot be represented in type 'long int'
> > 
> > On looking at silencing ubsan, I found a real bug in dlang_number.
> > For a 32-bit long, some overflows won't be detected.  For example,
> > 21474836480.  Why?  Well 214748364 * 10 is 0x7FF8 (no overflow so
> > far).  Adding 8 gives 0x8000 (which does overflow but there is no
> > test for that overflow in the code).  Then multiplying 0x8000 * 10
> > = 0x5 = 0 won't be caught by the multiplication overflow test.
> > The same holds for a 64-bit long using similarly crafted digit
> > sequences.
> > 
> > This patch replaces the mod 10 test with a simpler limit test, and
> > similarly the mod 26 test in dlang_decode_backref.
> > 
> > About the limit test:
> >   val * 10 + digit > ULONG_MAX is the condition for overflow
> > ie.
> >   val * 10 > ULONG_MAX - digit
> > or
> >   val > (ULONG_MAX - digit) / 10
> > or assuming the largest digit
> >   val > (ULONG_MAX - 9) / 10
> > 
> > I resisted the aesthetic appeal of simplifying this further to
> >   val > -10UL / 10
> > since -1UL for ULONG_MAX is only correct for 2's complement numbers.
> > 
> > Passes all the libiberty tests, on both 32-bit and 64-bit hosts.  OK
> > to apply?
> > 
> 
> Thanks Alan, change seems reasonable, however on giving it a mull over,
> I see that the largest number that dlang_number would need to be able to
> handle is UINT_MAX.  These two tests which decode a wchar value are
> representative of that (first is valid, second invalid).
> 
> #
> --format=dlang
> _D4test21__T3funVwi4294967295Z3funFNaNbNiNfZv
> test.fun!('\U').fun()
> #
> --format=dlang
> _D4test21__T3funVwi4294967296Z3funFNaNbNiNfZv
> _D4test21__T3funVwi4294967296Z3funFNaNbNiNfZv

Hmm, OK, on a 32-bit host those both won't be handled due to the
"(long) val < 0" test.

Do you really want the last one to fail on a 64-bit host, ie. not
produce "test.fun!('\U1').fun()"?  I'm guessing you do, but
I'd like that confirmed before posting a followup patch.

> I'm fine with creating a new PR and dealing with the above in a separate
> change though, as it will require a few more replacements to adjust the
> result parameter type to 'unsigned' or 'long long'.  

Ha!  When I wrote the original patch I changed most of the "long"
parameters and variables to "unsigned long", because it was clear from
the code (and my understanding of name mangling) that the values were
in fact always unsigned.  I also changed "int" to "size_t" where
appropriate, not that you need to handle strings larger than 2G in
length but simply that size_t is the correct type to use with string
functions, malloc, memcpy and so on.  I decided to remove all those
changes before posting as they were just tidies, I thought.

-- 
Alan Modra
Australia Development Lab, IBM


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Rodriguez Bahena, Victor via Gcc-patches


-Original Message-
From: Qing Zhao 
Date: Thursday, September 3, 2020 at 12:55 PM
To: Kees Cook 
Cc: Segher Boessenkool , Jakub Jelinek 
, Uros Bizjak , "Rodriguez Bahena, Victor" 
, GCC Patches 
Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]



> On Sep 3, 2020, at 12:13 PM, Kees Cook  wrote:
> 
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>> On average, all the options starting with “used_…”  (i.e, only the 
registers that are used in the routine will be zeroed) have very low runtime 
overheads, at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
>> If all the registers will be zeroed, the runtime overhead is bigger, 
all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
average. 
>> Looks like the overhead of zeroing vector registers is much bigger. 
>> 
>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
the runtime overhead with this is very small.
> 
> That looks great; thanks for doing those tests!
> 
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)

The top 3 benchmarks that have the most overhead from this option are: 
531.deepsjeng_r, 541.leela_r, and 511.povray_r.
All of them are C++ benchmarks. 
I guess that the most important reason is  the smaller routine size in 
general (especially at the hot execution path or loops).
As a result, the overhead of these additional zeroing instructions in each 
routine will be relatively higher.  

Qing

I think that overhead is expected in benchmarks like 541.leela_r, according to 
https://www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html is a benchmark 
for Artificial Intelligence (Monte Carlo simulation, game tree search & pattern 
recognition). The addition of fzero-call-used-regs will represent an overhead 
each time the functions are being call and in areas like game tree search is 
high. 

Qing, thanks a lot for the measurement, I am not sure if this is the limit of 
overhead the community is willing to accept by adding extra security (me as gcc 
user will be willing to accept). 

Regards

Victor 


> 
> -- 
> Kees Cook




Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-09-03 Thread Feng Xue OS via Gcc-patches
>> Hi,
>>
>> On Mon, Aug 31 2020, Feng Xue OS wrote:
>> > This patch is to fix a bug that cost that is used to evaluate clone 
>> > candidate
>> > becomes negative due to integer overflow.
>> >
>> > Feng
>> > ---
>> > 2020-08-31  Feng Xue  
>> >
>> > gcc/
>> > PR tree-optimization/96806
>>
>> the component is "ipa," please change that when you commit the patch.
>>
>> > * ipa-cp.c (decide_about_value): Use safe_add to avoid cost 
>> > addition
>> > overflow.
>>
>> assuming you have bootstrapped and tested it, it is OK for both trunk
>> and all affected release branches.
>
>I have already added caps on things that come from profile counts so
>things do not overflow, but I think in longer run we want to simply use
>sreals here..
>> >&& !good_cloning_opportunity_p (node,
>> > - val->local_time_benefit
>> > - + val->prop_time_benefit,
>> > + safe_add (val->local_time_benefit,
>> > +   val->prop_time_benefit),
>> >   freq_sum, count_sum,
>> > - val->local_size_cost
>> > - + val->prop_size_cost))
>> > + safe_add (val->local_size_cost,
>> > +   val->prop_size_cost)))
>
>Is it also size cost that may overflow? That seem bit odd ;)
>

Yes. prop_size_cost accumulates all callees' size_cost. And since
there are two recursive calls, this value increases exponentially
as 2's power, and easily exceeds value space of integer.

It is actually a defect of cost computation for recursive cloning.
But I think we need a complete consideration on how to adjust
cost model for recursive cloning, including profile estimation,
threshold, size_cost...

And a quick fix is to add a cap here to avoid overflow.

Feng

>Honza
>> >  return false;
>> >
>> >if (dump_file)
>>
>> [...]
>

Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-03 Thread luoxhu via Gcc-patches
Hi,

On 2020/9/3 18:29, Richard Biener wrote:
> On Thu, Sep 3, 2020 at 11:20 AM luoxhu  wrote:
>>
>>
>>
>> On 2020/9/2 17:30, Richard Biener wrote:
 so maybe bypass convert_vector_to_array_for_subscript for special 
 circumstance
 like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert 
 builtin
 call a relative simpler method?
>>> I think you have it backward.  You need to work with what
>>> convert_vector_to_array_for_subscript
>>> gives and deal with it during RTL expansion / optimization to generate
>>> more optimal
>>> code for power.  The goal is to have as little target specific
>>> builtins during the GIMPLE
>>> optimization phase (because we cannot work out its semantics in optimizers).
>>
>> OK, got it, will add optabs vec_insert and expand 
>> "VIEW_CONVERT_EXPR(v1)[_1] = i_6(D);"
>> expressions to rs6000_expand_vector_insert instead of builtin call.
>> vec_extract already has optabs and "i = v[n%4]" should be in another patch
>> after this.
> 
> There is already vec_set and vec_extract - the question is whether the 
> expander
> tries those for variable index.
> 

Yes, I checked and found that both vec_set and vec_extract doesn't support
variable index for most targets, store_bit_field_1 and extract_bit_field_1
would only consider use optabs when index is integer value.  Anyway, it
shouldn't be hard to extend depending on target requirements. 

Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
different gimple code:

{
_1 = n & 3;
VIEW_CONVERT_EXPR(v1)[_1] = i;
}

vs:

{
  __vector signed int v1;
  __vector signed int D.3192;
  long unsigned int _1;
  long unsigned int _2;
  int * _3;

   [local count: 1073741824]:
  D.3192 = v_4(D);
  _1 = n_7(D) & 3;
  _2 = _1 * 4;
  _3 = &D.3192 + _2; 
  *_3 = i_8(D);
  v1_10 = D.3192;
  return v1_10;
}

If not use builtin for "vec_insert(v, i, n)", the pointer is "int*" instead
of vector type, will this be difficult for expander to capture so many
statements then call the optabs?  So shall we still keep the builtin style
for "vec_insert(v, i, n)" and expand "v[n&3]=i" with optabs or expand both 
with optabs???

Drafted a fast patch to expand "v[n&3]=i" with optabs as below, sorry that not
using existed vec_set yet as not quite sure, together with the first patch, both
cases could be handled as expected:


[PATCH] Expander: expand VIEW_CONVERT_EXPR to vec_insert with variable index

v[n%4] = i has same semantic with vec_insert (i, v, n), but it will be
optimized to "VIEW_CONVERT_EXPR(v1)[_1] = i;" in gimple, this
patch tries to recognize the pattern in expander and use optabs to
expand it to fast instructions like vec_insert: lvsl+xxperm+xxsel.

gcc/ChangeLog:

* config/rs6000/vector.md:
* expr.c (expand_assignment):
* optabs.def (OPTAB_CD):
---
 gcc/config/rs6000/vector.md | 13 +++
 gcc/expr.c  | 46 +
 gcc/optabs.def  |  1 +
 3 files changed, 60 insertions(+)

diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 796345c80d3..46d21271e17 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1244,6 +1244,19 @@ (define_expand "vec_extract"
   DONE;
 })
 
+(define_expand "vec_insert"
+  [(match_operand:VEC_E 0 "vlogical_operand")
+   (match_operand: 1 "register_operand")
+   (match_operand 2 "register_operand")]
+  "VECTOR_MEM_ALTIVEC_OR_VSX_P (mode)"
+{
+  rtx target = gen_reg_rtx (V16QImode);
+  rs6000_expand_vector_insert (target, operands[0], operands[1], operands[2]);
+  rtx sub_target = simplify_gen_subreg (GET_MODE(operands[0]), target, 
V16QImode, 0);
+  emit_insn (gen_rtx_SET (operands[0], sub_target));
+  DONE;
+})
+
 ;; Convert double word types to single word types
 (define_expand "vec_pack_trunc_v2df"
   [(match_operand:V4SF 0 "vfloat_operand")
diff --git a/gcc/expr.c b/gcc/expr.c
index dd2200ddea8..ce2890c1a2d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -5237,6 +5237,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
 
   to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);
 
+  tree type = TREE_TYPE (to);
+  if (TREE_CODE (to) == ARRAY_REF && tree_fits_uhwi_p (TYPE_SIZE (type))
+ && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+ && tree_to_uhwi (TYPE_SIZE (type))
+* tree_to_uhwi (TYPE_SIZE_UNIT (type))
+  == 128)
+   {
+ tree op0 = TREE_OPERAND (to, 0);
+ tree op1 = TREE_OPERAND (to, 1);
+ if (TREE_CODE (op0) == VIEW_CONVERT_EXPR)
+   {
+ tree view_op0 = TREE_OPERAND (op0, 0);
+ mode = TYPE_MODE (TREE_TYPE (view_op0));
+ if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE)
+   {
+ rtx value
+   = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+ rtx pos
+   = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+

[wwwdocs PATCH] projects/tree-ssa: add note for deprecated flag -ftree-vectorizer-verbose in vectorization.html

2020-09-03 Thread Hu Jiangping
Although vectorization.html is not up-to-date, it is still
easy to be searched, and the deprecated flag in it may
confuse users.  This patch simply adds a note to the head
of the page, hoping to help users who read it.

OK for master?

Regards!
Hujp

---
 htdocs/projects/tree-ssa/vectorization.html | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/htdocs/projects/tree-ssa/vectorization.html 
b/htdocs/projects/tree-ssa/vectorization.html
index 22d136c0..ab07eb28 100644
--- a/htdocs/projects/tree-ssa/vectorization.html
+++ b/htdocs/projects/tree-ssa/vectorization.html
@@ -13,7 +13,13 @@
 The goal of this project was to develop a loop and basic block
 vectorizer in GCC, based on the tree-ssa framework.
 It has been completed and the functionality has been part of GCC
-for years.
+for years.
+
+(Note: The -ftree-vectorizer-verbose flag below is now
+deprecated.  New flag -fopt-info see https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info";>
+   
https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info)
+
 
 Table of Contents
 
-- 
2.17.1





Re: [PATCH] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-03 Thread luoxhu via Gcc-patches




On 2020/9/4 14:16, luoxhu via Gcc-patches wrote:

Hi,


Yes, I checked and found that both vec_set and vec_extract doesn't support
variable index for most targets, store_bit_field_1 and extract_bit_field_1
would only consider use optabs when index is integer value.  Anyway, it
shouldn't be hard to extend depending on target requirements.

Another problem is v[n&3]=i and vec_insert(v, i, n) are generating with
different gimple code:

{
_1 = n & 3;
VIEW_CONVERT_EXPR(v1)[_1] = i;
}

vs:

{
   __vector signed int v1;
   __vector signed int D.3192;
   long unsigned int _1;
   long unsigned int _2;
   int * _3;

[local count: 1073741824]:
   D.3192 = v_4(D);
   _1 = n_7(D) & 3;
   _2 = _1 * 4;
   _3 = &D.3192 + _2;
   *_3 = i_8(D);
   v1_10 = D.3192;
   return v1_10;
}


Just realized use convert_vector_to_array_for_subscript would generate
"VIEW_CONVERT_EXPR(v1)[_1] = i;" before produce those instructions, 
 your confirmation and comments will be highly appreciated...  Thanks in 
advance. :)


Xionghu



If not use builtin for "vec_insert(v, i, n)", the pointer is "int*" instead
of vector type, will this be difficult for expander to capture so many
statements then call the optabs?  So shall we still keep the builtin style
for "vec_insert(v, i, n)" and expand "v[n&3]=i" with optabs or expand both
with optabs???

Drafted a fast patch to expand "v[n&3]=i" with optabs as below, sorry that not
using existed vec_set yet as not quite sure, together with the first patch, both
cases could be handled as expected:


[PATCH] Expander: expand VIEW_CONVERT_EXPR to vec_insert with variable index

v[n%4] = i has same semantic with vec_insert (i, v, n), but it will be
optimized to "VIEW_CONVERT_EXPR(v1)[_1] = i;" in gimple, this
patch tries to recognize the pattern in expander and use optabs to
expand it to fast instructions like vec_insert: lvsl+xxperm+xxsel.

gcc/ChangeLog:

* config/rs6000/vector.md:
* expr.c (expand_assignment):
* optabs.def (OPTAB_CD):
---
  gcc/config/rs6000/vector.md | 13 +++
  gcc/expr.c  | 46 +
  gcc/optabs.def  |  1 +
  3 files changed, 60 insertions(+)

diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 796345c80d3..46d21271e17 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1244,6 +1244,19 @@ (define_expand "vec_extract"
DONE;
  })

+(define_expand "vec_insert"
+  [(match_operand:VEC_E 0 "vlogical_operand")
+   (match_operand: 1 "register_operand")
+   (match_operand 2 "register_operand")]
+  "VECTOR_MEM_ALTIVEC_OR_VSX_P (mode)"
+{
+  rtx target = gen_reg_rtx (V16QImode);
+  rs6000_expand_vector_insert (target, operands[0], operands[1], operands[2]);
+  rtx sub_target = simplify_gen_subreg (GET_MODE(operands[0]), target, 
V16QImode, 0);
+  emit_insn (gen_rtx_SET (operands[0], sub_target));
+  DONE;
+})
+
  ;; Convert double word types to single word types
  (define_expand "vec_pack_trunc_v2df"
[(match_operand:V4SF 0 "vfloat_operand")
diff --git a/gcc/expr.c b/gcc/expr.c
index dd2200ddea8..ce2890c1a2d 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -5237,6 +5237,52 @@ expand_assignment (tree to, tree from, bool nontemporal)

to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);

+  tree type = TREE_TYPE (to);
+  if (TREE_CODE (to) == ARRAY_REF && tree_fits_uhwi_p (TYPE_SIZE (type))
+ && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+ && tree_to_uhwi (TYPE_SIZE (type))
+* tree_to_uhwi (TYPE_SIZE_UNIT (type))
+  == 128)
+   {
+ tree op0 = TREE_OPERAND (to, 0);
+ tree op1 = TREE_OPERAND (to, 1);
+ if (TREE_CODE (op0) == VIEW_CONVERT_EXPR)
+   {
+ tree view_op0 = TREE_OPERAND (op0, 0);
+ mode = TYPE_MODE (TREE_TYPE (view_op0));
+ if (TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE)
+   {
+ rtx value
+   = expand_expr (from, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+ rtx pos
+   = expand_expr (op1, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+ rtx temp_target = gen_reg_rtx (mode);
+ emit_move_insn (temp_target, to_rtx);
+
+ machine_mode outermode = mode;
+ scalar_mode innermode = GET_MODE_INNER (outermode);
+ class expand_operand ops[3];
+ enum insn_code icode
+   = convert_optab_handler (vec_insert_optab, innermode,
+outermode);
+
+ if (icode != CODE_FOR_nothing)
+   {
+ pos = convert_to_mode (E_SImode, pos, 0);
+
+ create_fixed_operand (&ops[0], temp_target);
+ create_input_operand (&ops[1], value, innermode);
+ create_input_operand (&ops[2], pos, GET_MODE (pos));
+   

Re: [PATCH, rs6000] Fix Vector long long subtype (PR96139)

2020-09-03 Thread Richard Biener via Gcc-patches
On Thu, Sep 3, 2020 at 8:10 PM Segher Boessenkool
 wrote:
>
> Hi!
>
> On Thu, Sep 03, 2020 at 10:37:33AM -0500, will schmidt wrote:
> > On Wed, 2020-09-02 at 05:13 -0500, Segher Boessenkool wrote:
> > > On Tue, Sep 01, 2020 at 09:00:20PM -0500, will schmidt wrote:
> > > >   This corrects an issue with the powerpc vector long long
> > > > subtypes.
> > > > As reported by SjMunroe in PR96139.  When building some code with
> > > > -Wall
> > > > and attempting to print an element of a "long long vector" with a
> > > > long long
> > > > printf format string, we will report a error because the vector
> > > > sub-type
> > > > was improperly defined as int.
> > > >
> > > > When defining a V2DI_type_node we use a TARGET_POWERPC64 ternary to
> > > > define the V2DI_type_node with "vector long" or "vector long long".
> > > > We also need to specify the proper sub-type when we define the
> > > > type.
> > > > -  V2DI_type_node = rs6000_vector_type (TARGET_POWERPC64 ?
> > > > "__vector long"
> > > > -: "__vector long long",
> > > > -intDI_type_node, 2);
> > > > +  V2DI_type_node
> > > > += rs6000_vector_type (TARGET_POWERPC64
> > > > +   ? "__vector long" : "__vector long long",
> > > > +   TARGET_POWERPC64
> > > > +   ? long_long_integer_type_node :
> > > > intDI_type_node,
> > > > +   2);
> > >
> > > Can't you just use long_long_integer_type_node in all cases?  Or,
> > > what
> > > else is intDI_type_node for 32 bit?
> >
> > I'm not sure offhand.  I'm tending to assume the use of intDI_type_node
> > is critical for some underlying reason.I'll give this a spin with
> > just long_long_integer_type_node and see what happens.
>
> If that works, that is okay for trunk (and all backports you want).  If
> not, just use what you sent.

Beware of alias issues!  'long' and 'long long' are distinct types even if
they have the same size and long * aliases with vector * but not
with vector *.

So if the user writes 'vector long' you should use the appropriate component
type.  If the user writes 'vector intDI_type' you should use intDI_type.

Richard.

> Thanks!
>
>
> Segher