Re: On-Demand range technology [2/5] - Major Components : How it works

2019-05-23 Thread Richard Biener
On Thu, May 23, 2019 at 3:28 AM Andrew MacLeod  wrote:
>
> *This note will talk about the 4 major components of the prototype and
> explain how they work together.   I will be fairly light on detail just
> to give an overview, we can delve into whatever details are needed.
> - Range-ops : Range operations at the statement level
> - GORI - Generates Outgoing Range Info : Ranges generated by basic blocks
> - GORI Cache - combining and caching basic-block range info
> - Ranger - API and query control
>
> 1 * Range-ops
> 
> The first major component is the centralized range operation database.
> This is where range operations for tree-codes are implemented.  The
> compiler currently has code which can fold ranges, but the new mechanism
> is a general purpose solver which can solve for other operands. If there
> are N input/output operands, and we have ranges for N-1, It is often
> possible to derive the missing range.   ie
>  lhs = op1 + op2
> The existing code in the compiler can solve for LHS given ranges for OP1
> and OP2. This has been extended so that we can also sometimes solve for
> either op1 or op2 e, ie
>  [20,40] = op1 + [5, 10]
> ...can calculate that op1 has the range [15, 30]
>
> This ability to solve for the other operands provides the ability to
> calculate ranges in the reverse order we are accustomed to, and is key
> to enabling the on-demand range approach.
>  a_2 = b_1 - 20
>  if (a_2 < 40)
>   A conditional jump has an implicit boolean LHS depending on which edge
> is taken. To evaluate ranges on the TRUE edge of the branch, the LHS is
> [1,1]. To calculate the range for a_2 we simply solve the equation:
>  [1,1] = a_2 < 40
> which provides the answer as [0,39].  Furthermore, substituting this
> range for a_2 as the LHS of it’s definition statement:
>  [0,39] = b_1 - 20
> The same evaluation  mechanism can calculate a result for b_1 on the
> true edge as [20,59]. This is the key feature which allows the rest of
> the algorithm to work in a general way.
>
> All operations are driven from this component, and the only thing
> required to add support for another tree code is to implement one or
> more methods for it here, and the rest of the range mechanism will
> simply work with it.
>
>
> 2 * GORI
> 
> The second component is the “Generates Outgoing Range Info” engine.
> This is a basic-block oriented component which determines what ssa-names
> have ranges created on outgoing edges, and how to calculate those ranges.
>
> The last statement in the block is examined to see if it is a branch
> which generates range info, and then determines which ssa-names in the
> block can have ranges calculated.  It quickly walks the use-def chain
> from the branch to find other definitions within the block that can
> impact the branch and could also have their ranges calculated. This
> summary is then cached for future use.
>
> The GORI component also uses this summary info to perform the
> calculations necessary to determine the outgoing range for any ssa_name
> which can be determined.  For example:
>  c_3 = foo ()
>  a_2 = b_1 - 20
>  If (a_2 < 40)
> The summary information would indicate that b_1 and a_2 can have their
> outgoing ranges calculated for this block, and uses the cached
> information to quickly calculate them when required.
> The API consists of 2 basic methods, query and calculate:
>   - bool has_edge_range_p (edge, name);
>   - range outgoing_edge_range (edge, name);
> If a query is made for any other ssa-name, it simply returns false and
> says this block does not generate a range for that name.  This is a key
> rationale for the summary so we only spend time processing names for
> blocks that actually have ranges generated.

So the current code (in a few cases) derives ranges for SSA uses when
they see for example

 _3 = _4 / _5;

here _5 != 0 for the stmts following this one (OK, this is a bad example
because for divisions we don't do this, but we do it for derefs and ~[0,0]).

The above suggests that iff this is done at all it is not in GORI because
those are not conditional stmts or ranges from feeding those.  The
machinery doing the use-def walking from stmt context also cannot
come along these so I have the suspicion that Ranger cannot handle
telling us that for the stmt following above, for example

 if (_5 != 0)

that _5 is not zero?

Can you clarify?

>
> 3 * GORI cache
> -
> The third component builds on the basic block GORI component, and adds
> the ability to traverse the CFG and combine the various ranges provided
> by each basic block into a cache.
>
> It provides both a global-range cache and a range-on-entry cache
> summarizing the range of an ssa_name on entry to the block.
>
> The range-on-entry cache is self filling, and iteratively evaluates back
> edges. There is a single API entry point which asks for the range on
> entry, and if the data has not been calculated/cached already, i

Re: On-Demand range technology [3/5] - The Prototype

2019-05-23 Thread Richard Biener
On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod  wrote:
>
> There is a functioning prototype in branch “ssa-range” which is a proof
> of concept that the approach is functional as well as quick, and can be
> used to answer questions which come up regarding what it can and can’t
> do.  Our last merge was on April 13th, so it's fairly up to date.
>
> We have implemented a flexible range class (irange) which allows for
> multiple subranges to represent a range, and which can be extended in
> the future to types other than integral. We use this throughout, but it
> could be replaced in the ranger with any similar API. Conversion
> routines are also provided to convert from irange to value_range and
> value_range to irange.
>
> A full set of tree_code range-op routines are implemented.  We have
> commoned as much code as possible with the existing VRP range extraction
> code.  Also, we have added additional code for calculating the other
> operands from a known result in numerous cases.
>
> The code base in VRP has been modified (via a flag) to
>  - Work completely with the native value_range like it does today.
>  - Use irange and the range-ops component under the covers to
> extract ranges. Requests in VRP are then converted from value_ranges to
> irange, called into the range-op routines, and then converted back to
> value_range for VRP/EVRP’s use.
>  - Do operations both ways and compare the results to make sure both
> agree on the range, and trap if they do not.
>
> The branch defaults to the compare and trap mode to ensure everything is
> working correctly. This has been our correctness model for statement
> range extraction and was active during the Fedora package builds. The
> only time we disabled it was to do performance runs vs RVRP, and were
> looking at both branch and trunk times for EVRP and VRP.
>
> Of note, we drop all symbolics in ranges to VARYING on everything except
> PLUS and MINUS, which we leave as native calculations if there are
> symbolics present.  More on symbolics later.
>
> A VRP like pass called RVRP has been implemented.
>  - The vr-values statement simplification code has been factored out
> to be range agnostic, meaning that these routines can operate on either
> value_range or irange. Thus, we are using a common code base to perform
> statement simplification as well.
>  - For complete compatibility with EVRP, the RVRP pass builds
> dominators and instantiates the SCEV loop information so we have loop
> range info available. RVRP does not need this info to run, but would
> miss some of the test cases which depend on loop ranges.
>  - RVRP is set up to demonstrate it can process the IL in multiple
> directions and bootstraps/passes all tests in all directions.
>  * Dominator order
>  * Post-dominator order
>  * BB1 thru BBn
>  * BBn thru BB1
>  * branch-only mode where only branches at the end of each BB
> are examined for folding opportunities
>
> 4 additional passes have been converted to use the ranger model:
>  - sprintf - removed the dominator building/walking
>  - warn alloca - replaced calls to get global ranges with calls that
> now return context sensitive ranges.
>  - warn restrict - just replaced EVRP range calls with ranger calls.
>  - backwards threader - enhanced to use contextual range information
> to make additional threading decisions.
>
>
> Symbolic Ranges
>
> One big concern last year expressed was my decision to abolish symbolic
> ranges.
>
> I continue to maintain that there is no need to track the range of x_2
> as [y_3 + 5, MAX] for x_2 = y_3 + 5.   All you need to do is look at the
> definition of x_2, and the same information is exposed right there in
> the IL.  If one requires the symbolic information, the same on-demand
> process could lookup that information as well. This in turn, makes the
> code for ranges much simpler, easier to maintain, and less likely to
> introduce bugs.
>
> We have found through our prototype that symbolics in ranges are not
> nearly as prevalent as one would think.   During the work to share a
> common code base with VRP, we found that symbolic ranges are irrelevant
> for everything except PLUS_EXPR and MINUS_EXPR. The shared code in our
> branch drops symbolics to varying immediately for everything else, and
> it has no impact on EVRP, or VRP or any tests we can find anywhere.
> Furthermore, we never trapped while comparing ranges generated by VRP
> versus generating them with range-ops which drops symbolics to varying.
>
> We tried modifying VRP such that we don’t even create symbolic
> endpoints, but rather revert to VARYING always.  We can find no test
> case that fails because a range is not calculated properly due to
> resolving these endpoints.
>
> There are a few that fail due to the symbolic being used to help track
> relationals.. Ie
>
>   x_2 = y_3 + 5
>  If (x_2 > y_3) // can be folded since we know x_2 must be < y_3
>
> 

Re: On-Demand range technology [4/5] - Performance results

2019-05-23 Thread Richard Biener
On Thu, May 23, 2019 at 3:29 AM Andrew MacLeod  wrote:
>
> We have done extensive performance analysis to help address concerns
> about the nature of an on-demand model. LLVM made an attempt at
> something similar,  but suffered from significant performance issues
> they could not solve with their approach. This approach is not the same,
> and we have seen no sign of pathological cases which are problematic.
>
> I have trolled bugzilla looking for large problematic test cases, and
> have tried a couple of test cases from LLVM which dberlin pointed out
> they found to be problematic.  To date, I have found nothing that shows
> any kind of performance problem.  I am more than happy to entertain
> whatever cases y’all might want to throw my way.
>
> For a test of robustness, we have built a complete Fedora distribution
> consisting of 9174 packages with the ranger branch. All packages except
> 3 build successfully and pass the relevant regression tests. It appears
> 2 of them are related to RVRP and are still under analysis.
>
> Our primary performance testbase to date has been the compiler itself.
> We compile 242 .ii files from a stage1 compiler build, and compare times
> to EVRP. VRP is quite a bit slower, and although we do have an iterative
> update infrastructure, comparisons aren’t quite fair since we don’t do
> all equivalence and bitmask operations it does yet.
>
> Before going into the numbers, I would like to visit a minor issue we
> have with switches. RVRP works from the bottom up, so in order to
> evaluate a range, it begins by getting the constant range for the LHS of
> the branch from the edge. For a conditional it is trivially [0,0] or
> [1,1] depending on TRUE or FALSE edge.
>
> For a switch, it turns out GCC gimple representation has no simple way
> to figure this out. As a result, we need to loop over every edge in the
> switch and then union together all the cases which share that edge, or
> in the default case, intersect out all the other cases. This turns out
> to be *very* time consuming in tests cases with very large switches,
> somewhere in the vicinity of O(n^3). Ugg.  So the ranger incurs a fair
> amount of overhead trying to evaluate, and re-evaluate these constant
> edges.
>
> There are ways to address this… but for now we will present performance
> numbers with different switch configurations, each of the 5
> configurations listed here:
>  1 - Calculating ranges outright  from the stock branch.
>  2 - Timers adjusted to exclude switch edge calculation code (i.e.
> pretend the range is available on the edge like it is with TRUE and FALSE.
>  3 - Do not process switches.  We spend extra time on switches
> because we always attempt to calculate ranges very precisely as if we
> had infinite precision. There is room for trimming outcomes here, but we
> have made no attempt yet.
>  4 - Just like EVRP,  RVRP currently includes building the
> dominators and integrating calls into SCEV at the beginning of each
> block to see if there are any loop range refinements.  The ranger has no
> need for this to operate, and many passes will not care.  So we produce
> a 4th number for RVRP where we don’t build any of the infrastructure it
> doesn’t need.
>  5 - RVRP can also run in conditional-only mode. Rather than walking
> the entire IL trying to resolve every range, it can simply look at the
> last statement of every block asking if the branch can be folded.  This
> gets a lot of what vrp gets that affects the CFG and could be utilized
> in either lower optimization levels, or as VRP if we can push all the
> other activities it does into appropriate optimizations (such as making
> CCP range aware).  **NOTE**: This *still* calculates switch ranges,
> so includes that slow down.
>
> All times are with a release configured compiler.  Out of the 242 files,
> pretty much across the board in all 4 sets of figures, RVRP was faster
> in about 90% of the cases, and slower in the other 10%, resulting in the
> following cumulative totals.
>
>  Overall (242 files)
> 1 - Raw RVRP  22% slower   (**a**)
> 2 - No edge calculation(1)4.5% slower
> 3 - No switch processing(2)9.5% faster
> 4 - No dominators(3)16% faster  (**b**)
> 5 - Conditionals (including switches) only4% faster

I don't understand a vs. b here.  Does this mean dominators
cost 38% performance?!

Does 4 imply 3 and 2 maybe?

Dominators cannot be this expensive.

> These numbers indicate that large switches are the primary cause when
> RVRP is slower.  We have various approaches we could use to address
> this.   Removing the time spent building unnecessary dominators shows a
> significant improvement as well.
>
> We also have the results for the time spent in the passes we converted:
>
>  Overall (242 files)
> 1 - -wrestrict19% faster
> 2 - -wprintf95% faster
> 3 - -wa

Re: On-Demand range technology [5/5] - Looking to the future.

2019-05-23 Thread Richard Biener
On Thu, May 23, 2019 at 3:30 AM Andrew MacLeod  wrote:
>
> A primary goal of this approach is to try to pull the various aspects of
> VRP apart and make them individually viable so they can be used at
> appropriate places as needed.  The various components of VRP were
> identified as:
>  - Ranges
>  - Relational queries
>  - Equivalencies
>  - Bitmask tracking
>  - Symbolic range endpoint resolution
>
> This prototype implementation tackles only the range component. It makes
> ranges easily accessible from anywhere in the compiler.
>
> I have plans for addressing these components within the same model, but
> maintaining their independence.  This should make maintenance easier,
> less error prone, and ultimately be far more flexible as other passes
> can utilize whichever aspects they need.
>
>
> Symbolic range endpoint resolution
> --
>
> I touched on this in the prototype section, and maintain that the only
> requirement for symbols falls under the equivalence and relational tracking.
>  x_2 = y_3 + 5
>  If (x_2 > y_3)
> Ranges themselves in VRP are eventually resolved to a constant for
> export to the global range table.  At this point, whatever range is
> known for the symbolic is substituted into the value_range, and any
> expression resolved to come up with the final non-symbolic range.
>  X_2 = [y_3 + 5, MAX]
> If y_3 evaluates to [20, 30], then x_2 is resolved as [25, MAX].
>
> The ranger does this by default on the fly due to its nature. When the
> range of x_2 is requested the first time, it evaluates y_3 , comes up
> with the same [20, 30] range for y_3, and evaluates it to [25,max]
> immediately.
>
> The facility is there to reevaluate the range if the range of y_3
> changes, but to this point it has not been needed. Typically it involves
> y_3 derived in some way from a back edge, and also being derived by yet
> another ssa-name from a different back edge. So, not super common.
> However, I do plan to get to this eventually to enable those
> re-calculations. For the protype, it has not been deemed critical since
> EVRP doesn't even do back edges.
>
> Equivalencies and other Relationals
> --
>
> The relationship between ssa names are the primary use of symbols in
> ranges today, but the actual property of relations and equivalencies has
> little to do with ranges.
>
> I propose that we utilize the exact same model as the range operation
> database to track relationships. Both equivalencies and relationals can
> be combined as “==” and “!=” is merely another relation.   Each tree
> code has a query to ask for the relationship between any of its
> operands. Ie:
>  y_2 = x_3
>  j_4 = y_2 + 6
>  If (j_4 > x_3)
> Knowing the ranges of j_4 and x_3 don’t really help resolve the
> condition.  If x_3 is varying, or even a non-constant, we know nothing
> at all, at least from a range perspective.
>
> Applying the same calculation model the ranger uses from a relational
> point of view, range-ops can be given a relational interface in which
> each tree code can evaluate the relation between its operands.   A copy
> would return “==” for the relation between the LHS and op1, so we’d have
> the relation y_2 == x_3
>
> Operator plus would look at its operands, and be able to indicate J_4 <
> Y_2 because operand 2 is a positive constant.
>
> The branch is the one we care about, and a query would be made for the
> relation between j_4 and x_3.  By combining the relations that feed it,
> we’d get the j_4 < (y_2 == x_3), and the relational result would be j_4
> < x_3.  When applied to (j_4 > x_3) the result is false.
>
> So the relational query would be able to answer the question without
> ever looking at a range, although if a range is available, it may help
> refine the answer.  The lookup process is identical to the way ranges
> are currently handled, which means the same query infrastructure can be
> leveraged and used independently or in concert with ranges.
>
> This would also benefit from not carrying information around unless it
> is requested/required. Currently all equivalences must be stored in case
> we need to know if there is an equivalency. Just like with ranges, this
> model would have no need to even look at an equivalency unless there is
> an actual need to know.
>
> Clearly there is work to do, but this has a lot of potential as a follow
> up to the range work since it uses the same infrastructure. Any pass
> could cheaply ask about the equivalence/relation between any 2 ssa_names.
>
>
> Bitmask tracking
> 
> Not to sound like a broken record, but the exact same process can also
> be applied to bitmasks.
>  X_2 = y_1 | 0x01
>  If (x_2 == 2)// can never be true
>
> Bitwise operators, as well as other operators like *, /, shifts, etc can
> calculate bitmasks  in exactly the same way ranges are calculated. I
> also consider

Re: -Wformat-diag: floating-point or floating point?

2019-05-23 Thread Joseph Myers
On Wed, 22 May 2019, Martin Sebor wrote:

>   ISO C does not support decimal floating point

And as a tangential point, that's not accurate for C2X, where DFP is an 
optional standard feature, but it's probably reasonable to keep the 
diagnostic saying "ISO C does not support" under its existing conditions 
until someone actually reviews and updates the DFP support for the 
differences between the old TR 24732:2009 and the newer TS 18661-2 / C2X 
version (not sure how big those differences are, but not much has changed 
in the DFP support in GCC lately).

-- 
Joseph S. Myers
jos...@codesourcery.com


What are the optimizations that contribute to ~70% improvement on SPEC06 hmmer benchmark?

2019-05-23 Thread a b
Recently I happen to notice that there is more than 70% performance improvement 
for SPEC06 hmmer benchmark from Linaro GCC 
5.2-2015.11-2
 to GCC 10.0 on ARM platforms.

I did some quick searching and think loop 
distribution
 and 
vectorization
 contribute  25% and 30%, respectively. But they still don't add up to 70%. Can 
you explain what else is helping here?

Thanks


Re: On-Demand range technology [3/5] - The Prototype

2019-05-23 Thread Eric Botcazou
> While I agree that symbolic ranges are a complication and that most
> cases it currently handles are not "value-range" things I do not agree
> with the idea that we can simply remove handling them and deal
> with the fallout in some later point in the future.  Eric may also be
> able to show you "real" symbolic value-range magic.

There are a couple of testcases in the testsuite that, I believe, require a 
minimal form of support for symbolic ranges: gcc.dg/tree-ssa/vrp94.c and 
gnat.dg/opt40.adb.  They deal with the following pattern in C:

  if (x >= y)
return 1;

  z = y - x;
  if (z <= 0)
abort ();

  return z;

where we want to eliminate the abort.  Of course the C version doesn't really 
make sense on its own, but it's the translation of the Ada version where the

  if (z <= 0)
abort ();

is generated by the compiler (it's a range check in Ada parlance).

I also agree that symbolic ranges are a complication but I don't understand 
why they would conceptually be treated differently from literal ranges.  Of 
course things may been different practically speaking because I also agree 
that they are of lesser importance than literal ranges in most cases.

> Note that symbolic ranges are already restricted to PLUS_EXPR
> and MINUS_EXPR (and NEGATE_EXPR I think).  There are
> also "symbolic" (non-integer constant) ranges like [&a, &a].

Yes, the current implementation is restricted to additive operations.

-- 
Eric Botcazou


gcc-7-20190523 is now available

2019-05-23 Thread gccadmin
Snapshot gcc-7-20190523 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/7-20190523/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 7 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-7-branch 
revision 271584

You'll find:

 gcc-7-20190523.tar.xzComplete GCC

  SHA256=aefef236ccdaf8cc6b546eaa48f2d6f71351800a3e08d7654a964597da8057e5
  SHA1=1cfc74efabeee272d401b0f8d18e13050db53ea4

Diffs from 7-20190516 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-7
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.