[RFC] Assertions as optimization hints

2016-11-14 Thread Yuri Gribov
Hi all,

I've recently revisited an ancient patch from Paolo
(https://gcc.gnu.org/ml/gcc-patches/2004-04/msg00551.html) which uses
asserts as optimization hints. I've rewritten the patch to be more
stable under expressions with side-effects and did some basic
investigation of it's efficacy.

Optimization is hidden under !defined NDEBUG && defined
__ASSUME_ASSERTS__. !NDEBUG-part is necessary because assertions often
rely on special !NDEBUG-protected support code outside of assert
(dedicated fields in structures and similar stuff, collectively called
"ghost variables"). __ASSUME_ASSERTS__ gives user a choice whether to
enable optimization or not (should probably be hidden under a friendly
compiler switch e.g. -fassume-asserts).

I do not have access to a good machine for speed benchmarks so I only
looked at size improvements in few popular projects. There are no
revolutionary changes (0.1%-1%) but some functions see good reductions
which may result in noticeable runtime improvements in practice. One
good  example is MariaDB where you frequently find the following
pattern:
  struct A {
virtual void foo() { assert(0); }
  };
  ...
  A *a;
  a->foo();
Here the patch will prevent GCC from inlining A::foo (as it'll figure
out that it's impossible to occur at runtime) thus saving code size.

Does this approach make sense in general? If it does I can probably
come up with more measurements.

As a side note, at least some users may consider this a useful feature:
http://www.nntp.perl.org/group/perl.perl5.porters/2013/11/msg209482.html

-I


0001-Optionally-treat-assertions-as-optimization-hints-dr.patch
Description: Binary data


omp-low.c split

2016-11-14 Thread Cesar Philippidis
What's the plan to split omp-low.c into multiple files? Right now,
omp-low.c contains code to lower and expand OpenMP and OpenACC. At least
for the OpenACC transforms, we made an effort to keep the changes in
omp-low.c target-independent. Is goal to break omp-low.c into separate
lowering, expansion and target/offloading-specific files?

Is there a timeline for it? The major pending OpenACC changes involve
the tile clause and routines. Most of the routine changes happen in the
FEs, we do preform some error handling in omp-low.c.

Cesar



HSA/BRIG front-end

2016-11-14 Thread David Edelsohn
I am pleased to announce that the GCC Steering Committee has
accepted the HSA/BRIG front-end for inclusion in GCC and appointed
Pekka Jaaskelainen and Martin Jambor as co-maintainers.

Please join me in congratulating Pekka and Martin on their new roles.
Please update your listings in the MAINTAINERS file.

Happy hacking!
David



Re: omp-low.c split

2016-11-14 Thread Jakub Jelinek
On Mon, Nov 14, 2016 at 09:45:49AM -0800, Cesar Philippidis wrote:
> What's the plan to split omp-low.c into multiple files? Right now,
> omp-low.c contains code to lower and expand OpenMP and OpenACC. At least
> for the OpenACC transforms, we made an effort to keep the changes in
> omp-low.c target-independent. Is goal to break omp-low.c into separate
> lowering, expansion and target/offloading-specific files?
> 
> Is there a timeline for it? The major pending OpenACC changes involve
> the tile clause and routines. Most of the routine changes happen in the
> FEs, we do preform some error handling in omp-low.c.

We first need to have all the pending GCC 7 omp/oacc patches committed (and
reviewed if they haven't been yet).  That includes gomp-nvptx stuff
(approved), HSA stuff (I'll try to review tomorrow on Wed), OpenACC stuff
(the tile patchset is reviewed but will need small adjustments, the rest I
plan to review soon).
Only once all that is in, we can consider the splitting.

Jakub


GCC 7.0.0 Status Report (2016-11-14), Stage 3 in effect now

2016-11-14 Thread Jakub Jelinek
Status
==

The trunk is in Stage 3 now, which means it is open for general
bugfixing.

Patches posted early enough during Stage 1 and not yet fully reviewed
may still get in early in Stage 3.  Please make sure to ping them
soon enough.


Quality Data


Priority  #   Change from last report
---   ---
P14+-  0
P2  114-   3
P3  177-  15
P4  109+-  0
P5   29+-  0
---   ---
Total P1-P3 295-  18
Total   433-  18


Previous Report
===

https://gcc.gnu.org/ml/gcc/2016-10/msg00176.html


Re: Suboptimal bb ordering with -Os on arm

2016-11-14 Thread Nicolai Stange
Hi Segher,

Segher Boessenkool  writes:

> On Fri, Nov 11, 2016 at 02:16:18AM +0100, Nicolai Stange wrote:
>> >> >From the discussion on gcc-patches [1] of what is now the aforementioned
>> >> r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
>> >> it is not clear to me whether this change can actually reduce code size
>> >> beyond those 0.1% given there for -Os.
>> >
>> > There is r228692 as well.
>> 
>> Ok, summarizing, that changelog says that the simple algorithm
>> potentially produced even bigger code with -Os than stc did. From that
>> commit on, this remains true only on x86 and mn10300. Right?
>
> x86 and mn10300 use STC at -Os by default.

Ah! This explains why I didn't see such a BB ordering on x86.


>> Sure. Let me restate my original question: assume for a moment that it
>> is true that -Os with simple never produces code smaller than 0.1% of
>> what is created by -Os with stc. I haven't got any idea what the "other
>> things" are able to achieve w.r.t code size savings, but to me, 0.1%
>> doesn't appear to be that huge. Don't get me wrong: I *really* can't
>> judge on whether 0.1% is a significant improvement or not. I'm just
>> assuming that it's not. With this assumption, the question of whether
>> those saved 0.1% are really worth the significantly decreased
>> performance encountered in some situations seemed just natural...
>
> It all depends on the tradeoff you want.  There are many knobs you can
> turn -- for example the inlining params, that has quite some effect on
> code size.
>
> -Os is mostly -O2 except those things that increase code size.
>
> What is the tradeoff in your case?  What is a realistic number for the
> slowdown of your kernel?  Do you see hotspots in there that should be
> handled better anyhow?  Etc.

Well, I do see hotspots in my crappy RFC code not yet upstreamed, namely
those 0.5 us extra latency on a memory stressed system as initially
mentioned. Problem is, this is in interrupt context...

I'm quite certain that this is due to a mispredicted branch in
  for(...) {
g()
  }
-- g() lives on another page and the TLB is cold.

However, once my test hardware is free again, I'll try to identify some
hotspots and get some numbers for a vanilla kernel.

For example, __hrtimer_runqueues() suffers from the same BB ordering,
but its code doesn't cross pages. I'd really have to measure it...


>> No, I want small, possibly at the cost of performance to the extent of
>> what's sensible. What sensible actually is is what my question is about.
>
> It is different for every use case I'm afraid.
>
>> So, summarizing, I'm not asking whether I should use -O2 or -Os or
>> whatever, but whether the current behaviour I'm seeing with -Os is
>> intended/expected quantitatively.
>
> With simple you get smaller code than with STC, so -Os uses simple.
> If that is ridiculously slower then you won't hear me complaining if
> you propose defaulting it the other way; but you haven't shown any
> convincing numbers yet?

Hmm, maybe there's a better way than changing the default.

If I read r228692 correctly, for the -Os case,
reorder_basic_blocks_simple() is made to retain the edge order as given
by the BB order (whatever this is?)  because this reduces code size
(because of reasons).

I think, the gain in code size savings is due to the favoring of the
total fallthrough edge count -- these won't lead to jump insns in the
output. Is this correct?

However, I believe that this edge ordering can be relaxed: it's only a
certain type of edges that must come before all the others.

The "all the others" can then be sorted by frequency again, thus leading
to better static branch prediction, especially in the loop case.

Thinking locally, if we have

  A
  |\
  | B
  |/
  C

we want to have the output order A, B, C, not A, C, B, because:
- A will have a single branch insn at its end either way,
- with the second ordering, B will need one, too.

Now, if B-C is considered first by the chain assembling part in
reorder_basic_blocks_simple(), this will produce the desired output.

I've appended a patch that does just this: it puts all the edges
originating from BB's with only one outgoing edge first and the rest,
sorted by frequency, after them.

The results for that single test case I've tried, namely the kernel
with some config on ARM, look surprisingly good:

W/o patch:
  0 .head.text026c  c0008000  c0008000  8000  2**2
  1 .text 002ab370  c010  c010  0001  2**5
 16 .init.text00027270  c06002e0  c06002e0  003f02e0  2**5
 17 .exit.text0594  c0627550  c0627550  00417550  2**2

W/ patch:
  0 .head.text026c  c0008000  c0008000  8000  2**2
  1 .text 002aaeac  c010  c010  0001  2**5
 16 .init.text0002723c  c06002e0  c06002e0  003f02e0  2**5
 17 .exit.text0594  c062751c  c062751c  0041751c  2**2


So even slightly smaller code is produced. But more importantly, it gets
the fallthrough for the loops