Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Janne Blomqvist
On Tue, Sep 12, 2017 at 4:57 PM, Wilco Dijkstra  wrote:
> Hi all,
>
> At the GNU Cauldron I was inspired by several interesting talks about 
> improving
> GCC in various ways. While GCC has many great optimizations, a common theme is
> that its default settings are rather conservative. As a result users are
> required to enable several additional optimizations by hand to get good code.
> Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was
> mentioned repeatedly) which GCC could/should do as well.
>
> Here are a few concrete proposals to improve GCC's option settings which will
> enable better code generation for most targets:
>
> * Make -fno-math-errno the default - this mostly affects the code generated 
> for
>   sqrt, which should be treated just like floating point division and not set
>   errno by default (unless you explicitly select C89 mode).

+1. Math functions setting errno is a blast from the past that needs
to die. That being said, this does to some extent depend on libm so
perhaps the default needs to be target-dependent.

> * Make -fno-trapping-math the default - another obvious one. From the docs:
>   "Compile code assuming that floating-point operations cannot generate
>user-visible traps."
>   There isn't a lot of code that actually uses user-visible traps (if any -
>   many CPUs don't even support user traps as it's an optional IEEE feature).
>   So assuming trapping math by default is way too conservative since there is
>   no obvious benefit to users.

As Mr. Myers explains, this is probably going a bit too far. I think
by default whatever fp optimizations are allowed with FENV_ACCESS off
is reasonable.

> * Make -fomit-frame-pointer the default - various targets already do this at
>   higher optimization levels, but this could easily be done for all targets.
>   Frame pointers haven't been needed for debugging for decades, however if 
> there
>   are still good reasons to keep it enabled with -O0 or -O1 (I can't think of 
> any
>   unless it is for last-resort backtrace when there is no unwind info at a 
> crash),
>   we could just disable the frame pointer from -O2 onwards.

Sounds reasonable.

> These are just a few ideas to start. What do people think? I'd welcome 
> discussion
> and other proposals for similar improvements.

What about the default behavior if no options are given? I think a
more reasonable default would be something roughly like

-O2 -Wall

or if debuggability is considered more important that speed & size, maybe

-Og -g -Wall

-- 
Janne Blomqvist


Re: Invalid free in standard library in trivial example with C++17 on gcc 7.2

2017-09-13 Thread Jonathan Wakely
On 12 September 2017 at 23:58, Dave Gittins wrote:
> I confirmed this issue on x86_64 CentOS, and independently here:
> https://wandbox.org/permlink/ncWqA9Zu3YEofqri
>
> Also fails on gcc trunk.
>
> Possibly related to bug 81338 "stringstream remains empty after being
> moved into multiple times"? Although I see that one is fixed by Mr
> Wakely.

That only affected the new ABI, and was present in all modes not just C++17.


Re: Byte swapping support

2017-09-13 Thread Jürg Billeter
On Tue, 2017-09-12 at 21:46 +0200, Eric Botcazou wrote:
> > In contrast to the existing scalar storage order support for structs, the
> > goal is to reverse the storage order for all memory operations to achieve
> > maximum compatibility with the behavior on big-endian systems, as far as
> > observable by the application.
> 
> I presume that you'll well aware of this, but you cannot just reverse the 
> storage order for any memory operation; for example, an array of 4 chars in C 
> is stored the same way in big-endian and little-endian order, so you ought 
> not 
> to do byte swapping when you access it as a whole.  So the above sentence 
> must 
> be read as "to reverse the storage order for all scalar memory operations".

Yes, I'm aware of this.  Thanks for stating this more clearly.

> When the scalar_storage_order attribute was designed, discussions lead to the 
> conclusion that doing the swapping for any scalar memory operation, as 
> opposed 
> to any access to a scalar within a structure, would not be a significant step 
> forward to warrant the significantly more complex implementation (or the big 
> performance penalty if you do things very roughly).

Was this considered significantly more complex because of the need to
discriminate between native and reverse order? Or do you expect similar
complexity even if this is not required (see my comment below)?

> > The plan is to insert byte swapping instructions as part of the RTL
> > expansion of GIMPLE assignments that access memory. This would leverage
> > code that was added for -fsso-struct, keeping the code simple and
> > maintainable.
> 
> How do you discriminate scalars stored in native order and scalars stored in 
> reverse order though?  That's the main difficulty of the implementation.

I don't. The idea is to reverse scalar storage order for the whole
userspace process and then add byte swapping to the Linux kernel when
accessing userspace memory. This keeps userspace memory consistent
with regards to endianness, which should lead to high compatibility
with big-endian applications. Userspace memory access from the kernel
always uses a small set of helper functions, which should make it
easier to insert byte swapping at appropriate places.

Jürg


Re: Byte swapping support

2017-09-13 Thread Jürg Billeter
On Tue, 2017-09-12 at 08:52 -0700, H.J. Lu wrote:
> Can you use __attribute__ ((scalar_storage_order)) in GCC 7?

To support existing large code bases, the goal is to reverse storage
order for all scalars, not just (selected) structs/unions. Also need to
support taking the address of a scalar field, for example. C++ support
will be required as well.

Jürg


Re: Byte swapping support

2017-09-13 Thread Paul.Koning

> On Sep 13, 2017, at 5:51 AM, Jürg Billeter  
> wrote:
> 
> On Tue, 2017-09-12 at 08:52 -0700, H.J. Lu wrote:
>> Can you use __attribute__ ((scalar_storage_order)) in GCC 7?
> 
> To support existing large code bases, the goal is to reverse storage
> order for all scalars, not just (selected) structs/unions. Also need to
> support taking the address of a scalar field, for example. C++ support
> will be required as well.

I wonder about that.  It's inefficient to do byte swapping on local data; it is 
only useful and needed on external data.  Data that goes to files for use by 
other-byte-order applications, or data that goes across a bus or network to 
consumers that have the other byte order.  A byte swapped local variable only 
consumes cycles and instruction space to no purpose.

My experience is that byte order marking of data is very useful, but it always 
is applied selectively just to those spots where it is needed.

paul


Re: How to configure a bi-arch PowerPC GCC?

2017-09-13 Thread Andreas Schwab
On Jul 20 2017, Sebastian Huber  wrote:

> Ok, so why do I get a "error: unrecognizable insn:"? How can I debug a
> message like this:
>
> (insn 12 11 13 2 (set (reg:CCFP 126)
> (compare:CCFP (reg:TF 123)
> (reg:TF 124))) "test-v0.i":5 -1
>  (nil))

This is supposed to be matched by the cmptf_internal1 pattern with
-mabi=ibmlongdouble.  Looks like your configuration defaults to
-mabi=ieeelongdouble.

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."


Re: Byte swapping support

2017-09-13 Thread Eric Botcazou
> Was this considered significantly more complex because of the need to
> discriminate between native and reverse order? Or do you expect similar
> complexity even if this is not required (see my comment below)?

The former.

> I don't. The idea is to reverse scalar storage order for the whole
> userspace process and then add byte swapping to the Linux kernel when
> accessing userspace memory. This keeps userspace memory consistent
> with regards to endianness, which should lead to high compatibility
> with big-endian applications. Userspace memory access from the kernel
> always uses a small set of helper functions, which should make it
> easier to insert byte swapping at appropriate places.

Well, if your userspace is entirely in reverse order, then of course things 
are totally different and I suspect that you'll pay the price in term of run 
time performance.  This is not what the attribute was designed for, although 
we added the -fsso-struct switch at some point.

-- 
Eric Botcazou


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Richard Biener
On Wed, Sep 13, 2017 at 3:21 AM, Michael Clark  wrote:
>
>> On 13 Sep 2017, at 1:15 PM, Michael Clark  wrote:
>>
>> - https://rv8.io/bench#optimisation
>> - https://rv8.io/bench#executable-file-sizes
>>
>> -O2 is 98% perf of -O3 on x86-64
>> -Os is 81% perf of -O3 on x86-64
>>
>> -O2 saves 5% space on -O3 on x86-64
>> -Os saves 8% space on -Os on x86-64
>>
>> 17% drop in performance for 3% saving in space is not a good trade for a 
>> “general” size optimisation. It’s more like executable compression.
>
> Sorry fixed typo:
>
> -O2 is 98% perf of -O3 on x86-64
> -Os is 81% perf of -O3 on x86-64
>
> -O2 saves 5% space on -O3 on x86-64
> -Os saves 8% space on -O3 on x86-64
>
> The extra ~3% space saving for ~17% drop in performance doesn’t seem like a 
> good general option for size based on the cost in performance.
>
> Again. I really like GCC’s -O2 and hope that its binaries don’t grow in size 
> nor slow down.

I think with GCC -Os and -O2 are essentially the same with the
difference that -Os assumes regions are cold and thus to be
optimized for size and -O2 assumes they are hot and thus to be
optimized for speed in cases there is not heuristic proving
otherwise.  I know this doesn't 100% reflect implementation reality
but it should be close.

IMHO we should turn on flags we turn on with -fprofile-use and have
some more nuances in optimize_*_for_{speed,size} as we
now track profile quality more closely.

I see -O1 as mostly worthless unless you are compiling
machine-generated code that makes -O2+ go OOM/time.  Apart
from avoiding quadratic or worse algorithms -O1 sees no love.

On its own -O3 doesn't add much (some loop opts and slightly more
aggressive inlining/unrolling), so whatever it does we
should consider doing at -O2 eventually.

Richard.


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Kevin André
On Wed, Sep 13, 2017 at 9:43 AM, Janne Blomqvist
 wrote:
> On Tue, Sep 12, 2017 at 4:57 PM, Wilco Dijkstra  
> wrote:
>> These are just a few ideas to start. What do people think? I'd welcome 
>> discussion
>> and other proposals for similar improvements.
>
> What about the default behavior if no options are given? I think a
> more reasonable default would be something roughly like
>
> -O2 -Wall
>
> or if debuggability is considered more important that speed & size, maybe
>
> -Og -g -Wall

Enabling (some) warnings by default seems reasonable to me. Not sure
about the rest though.

This is something people can't seem to agree on. Some like warnings by
default, some like optimizations by default. Some are against warnings
by default, arguing that people like distro-builders have no need for
warnings (for example). Some are against optimizations by default,
because it would make compilation slower when they just want to check
if some piece of code compiles successfully or not (for example).

I think the only way to decide what options to enable by default is to
first decide who your target audience is going to be. Who do you
expect to run gcc with no options specified? If you ask me, that will
mostly be students or beginners. Those who are experienced are more
likely to use an automated build system where they specify build
options only once and then forget about it. An unexperienced person
would need warnings and debug info by default, and maybe a few simple
optimizations that do not interfere with debugging. There can only be
one set of default options, and which one you pick will depend on who
your target audience is. You cannot please everyone.

-- 
Kevin


Re: Byte swapping support

2017-09-13 Thread Jürg Billeter
On Wed, 2017-09-13 at 13:08 +, paul.kon...@dell.com wrote:
> > On Sep 13, 2017, at 5:51 AM, Jürg Billeter  
> > wrote:
> > To support existing large code bases, the goal is to reverse storage
> > order for all scalars, not just (selected) structs/unions. Also need to
> > support taking the address of a scalar field, for example. C++ support
> > will be required as well.
> 
> I wonder about that.  It's inefficient to do byte swapping on local
> data; it is only useful and needed on external data.  Data that goes
> to files for use by other-byte-order applications, or data that goes
> across a bus or network to consumers that have the other byte
> order.  A byte swapped local variable only consumes cycles and
> instruction space to no purpose.

Existing code that has been written under the assumption of big-endian
memory layout can break on little-endian systems even with just local
data (e.g., accessing individual bytes of a 32-bit scalar). It would
obviously be best to fix all such code, however, that's not always
feasible, unfortunately.

Avoiding byte swapping for spilled registers where storage order is
(normally) not observable by applications will hopefully reduce the
performance overhead.

Jürg


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jakub Jelinek
On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
> On its own -O3 doesn't add much (some loop opts and slightly more
> aggressive inlining/unrolling), so whatever it does we
> should consider doing at -O2 eventually.

Well, -O3 adds vectorization, which we don't enable at -O2 by default.

Jakub


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Richard Biener
On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek  wrote:
> On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
>> On its own -O3 doesn't add much (some loop opts and slightly more
>> aggressive inlining/unrolling), so whatever it does we
>> should consider doing at -O2 eventually.
>
> Well, -O3 adds vectorization, which we don't enable at -O2 by default.

As said, -fprofile-use enables it so -O2 should eventually do the same
for "really hot code".

Richard.

> Jakub


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Allan Sandfeld Jensen
On Dienstag, 12. September 2017 23:27:22 CEST Michael Clark wrote:
> > On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra  wrote:
> > 
> > Hi all,
> > 
> > At the GNU Cauldron I was inspired by several interesting talks about
> > improving GCC in various ways. While GCC has many great optimizations, a
> > common theme is that its default settings are rather conservative. As a
> > result users are required to enable several additional optimizations by
> > hand to get good code. Other compilers enable more optimizations at -O2
> > (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should
> > do as well.
> 
> There are some nuances to -O2. Please consider -O2 users who wish use it
> like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC).
> 
> Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase
> code size can be skipped from -Os without drastically effecting
> performance.
> 
> This is not the case with GCC where -Os is a size at all costs optimisation
> mode. GCC users option for size not at the expense of speed is to use -O2.
> 
> Clang GCC
> -Oz   ~=  -Os
> -Os   ~=  -O2
> 
No. Clang's -Os is somewhat limited compared to gcc's, just like the clang -Og 
is just -O1. AFAIK -Oz is a proprietary Apple clang parameter, and not in 
clang proper.

'Allan


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Allan Sandfeld Jensen
On Mittwoch, 13. September 2017 15:46:09 CEST Jakub Jelinek wrote:
> On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
> > On its own -O3 doesn't add much (some loop opts and slightly more
> > aggressive inlining/unrolling), so whatever it does we
> > should consider doing at -O2 eventually.
> 
> Well, -O3 adds vectorization, which we don't enable at -O2 by default.
> 
Would it be possible to enable basic block vectorization on -O2? I assume that 
doesn't increase binary size since it doesn't unroll loops.

'Allan



Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jan Hubicka
> On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek  wrote:
> > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
> >> On its own -O3 doesn't add much (some loop opts and slightly more
> >> aggressive inlining/unrolling), so whatever it does we
> >> should consider doing at -O2 eventually.
> >
> > Well, -O3 adds vectorization, which we don't enable at -O2 by default.
> 
> As said, -fprofile-use enables it so -O2 should eventually do the same
> for "really hot code".

I don't see static profile prediction to be very useful here to find "really
hot code" - neither in current implementation or future. The problem of
-O2 is that we kind of know that only 10% of code somewhere matters for
performance but we have no way to reliably identify it.

I would make sense to have less agressive vectoriazaoitn at -O2 and more at
-Ofast/-O3.

Adding -Os and -Oz would make sense to me - even with hot/cold info it is not
desriable to optimize as agressively for size as we do becuase mistakes happen
and one do not want to make code paths 1000 times slower to save one byte
of binary.

We could handle this gratefully internally by having logic for "known to be 
cold"
and "guessed to be cold". New profile code can make difference in this.

Honza
> 
> Richard.
> 
> > Jakub


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jan Hubicka
> On Wed, Sep 13, 2017 at 3:21 AM, Michael Clark  wrote:
> >
> >> On 13 Sep 2017, at 1:15 PM, Michael Clark  wrote:
> >>
> >> - https://rv8.io/bench#optimisation
> >> - https://rv8.io/bench#executable-file-sizes
> >>
> >> -O2 is 98% perf of -O3 on x86-64
> >> -Os is 81% perf of -O3 on x86-64
> >>
> >> -O2 saves 5% space on -O3 on x86-64
> >> -Os saves 8% space on -Os on x86-64
> >>
> >> 17% drop in performance for 3% saving in space is not a good trade for a 
> >> “general” size optimisation. It’s more like executable compression.
> >
> > Sorry fixed typo:
> >
> > -O2 is 98% perf of -O3 on x86-64
> > -Os is 81% perf of -O3 on x86-64
> >
> > -O2 saves 5% space on -O3 on x86-64
> > -Os saves 8% space on -O3 on x86-64

I am bit surprised you see only 8% of code size difference
for -Os and -O3.  I look into these numbers occasionally and
it is usualy well over two digit number. 
http://hubicka.blogspot.cz/2014/04/linktime-optimization-in-gcc-2-firefox.html
http://hubicka.blogspot.cz/2014/09/linktime-optimization-in-gcc-part-3.html
has 42% code segment size reduction for Firefox and 19% for libreoffice

Honza


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Nikos Chantziaras

On 12/09/17 16:57, Wilco Dijkstra wrote:

[...] As a result users are
required to enable several additional optimizations by hand to get good code.
Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was
mentioned repeatedly) which GCC could/should do as well.
[...]

I'd welcome discussion and other proposals for similar improvements.


What's the status of graphite? It's been around for years. Isn't it 
mature enough to enable these:


-floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block

by default for -O2? (And I'm not even sure those are the complete set of 
graphite optimization flags, or just the "useful" ones.)




Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Richard Biener
On September 13, 2017 5:35:11 PM GMT+02:00, Jan Hubicka  wrote:
>> On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek 
>wrote:
>> > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote:
>> >> On its own -O3 doesn't add much (some loop opts and slightly more
>> >> aggressive inlining/unrolling), so whatever it does we
>> >> should consider doing at -O2 eventually.
>> >
>> > Well, -O3 adds vectorization, which we don't enable at -O2 by
>default.
>> 
>> As said, -fprofile-use enables it so -O2 should eventually do the
>same
>> for "really hot code".
>
>I don't see static profile prediction to be very useful here to find
>"really
>hot code" - neither in current implementation or future. The problem of
>-O2 is that we kind of know that only 10% of code somewhere matters for
>performance but we have no way to reliably identify it.

It's hard to do better than statically look at (ipa) loop depth. But shouldn't 
that be good enough? 

>
>I would make sense to have less agressive vectoriazaoitn at -O2 and
>more at
>-Ofast/-O3.

We tried that but the runtime effects were not offsetting the compile time 
cost. 

>Adding -Os and -Oz would make sense to me - even with hot/cold info it
>is not
>desriable to optimize as agressively for size as we do becuase mistakes
>happen
>and one do not want to make code paths 1000 times slower to save one
>byte
>of binary.
>
>We could handle this gratefully internally by having logic for "known
>to be cold"
>and "guessed to be cold". New profile code can make difference in this.
>
>Honza
>> 
>> Richard.
>> 
>> > Jakub



Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Jan Hubicka
> >I don't see static profile prediction to be very useful here to find
> >"really
> >hot code" - neither in current implementation or future. The problem of
> >-O2 is that we kind of know that only 10% of code somewhere matters for
> >performance but we have no way to reliably identify it.
> 
> It's hard to do better than statically look at (ipa) loop depth. But 
> shouldn't that be good enough? 

Only if you assume that you have whole program and understand indirect calls.
There are some stats on this here
http://ieeexplore.ieee.org/document/717399/

It shows that propagating static profile across whole progrma (which is just
tiny bit more fancy than counting loop depth) sort of work statistically.  I
really do not have very high hopes of this reliably working in production
compiler.  We already have PRs for single function benchmark where deep loop
nest is used ininitialization or so and the actual hard working part has small
loop nest & gets identified as cold.  

As soon as you start propagating in whole program context, such local mistakes
will become more comon.
> 
> >
> >I would make sense to have less agressive vectoriazaoitn at -O2 and
> >more at
> >-Ofast/-O3.
> 
> We tried that but the runtime effects were not offsetting the compile time 
> cost. 

Yep, i remember that.

Honza


Re: RFC: Improving GCC8 default option settings

2017-09-13 Thread Richard Biener
On September 13, 2017 6:24:21 PM GMT+02:00, Jan Hubicka  wrote:
>> >I don't see static profile prediction to be very useful here to find
>> >"really
>> >hot code" - neither in current implementation or future. The problem
>of
>> >-O2 is that we kind of know that only 10% of code somewhere matters
>for
>> >performance but we have no way to reliably identify it.
>> 
>> It's hard to do better than statically look at (ipa) loop depth. But
>shouldn't that be good enough? 
>
>Only if you assume that you have whole program and understand indirect
>calls.
>There are some stats on this here
>http://ieeexplore.ieee.org/document/717399/
>
>It shows that propagating static profile across whole progrma (which is
>just
>tiny bit more fancy than counting loop depth) sort of work
>statistically.  I
>really do not have very high hopes of this reliably working in
>production
>compiler.  We already have PRs for single function benchmark where deep
>loop
>nest is used ininitialization or so and the actual hard working part
>has small
>loop nest & gets identified as cold.  
>
>As soon as you start propagating in whole program context, such local
>mistakes
>will become more comon.

Heh, I would just make loop nests hot without globally making anything cold 
because of that. Basically sth like optimistic ipa profile propagation. 

Richard. 

>> 
>> >
>> >I would make sense to have less agressive vectoriazaoitn at -O2 and
>> >more at
>> >-Ofast/-O3.
>> 
>> We tried that but the runtime effects were not offsetting the compile
>time cost. 
>
>Yep, i remember that.
>
>Honza



Re: Byte swapping support

2017-09-13 Thread Jim Wilson

On 09/12/2017 02:32 AM, Jürg Billeter wrote:

To support applications that assume big-endian memory layout on little-
endian systems, I'm considering adding support for reversing the
storage order to GCC. In contrast to the existing scalar storage order
support for structs, the goal is to reverse the storage order for all
memory operations to achieve maximum compatibility with the behavior on
big-endian systems, as far as observable by the application.


Intel has support for this in icc.  It took about 5 years for a small 
team to make it work on a very large application.  That includes both 
the compiler development and application development time.  There are a 
lot of complicated issues that need to solved to make this work on real 
code, both in the compiler and in the application code.  There is a Dr 
Dobbs article about some of it, search for "Writing a Bi-Endian 
Compiler" if you are interested.


Even though they got it working, it was painful to use.  Icc goes to a 
lot of trouble to optimize away unnecessary byte-swapping to improve 
performance, but that meant any variable could be big or little endian 
despite how it was declared, and could be different endianness at 
different places in the code, and could even be both endianness (stored 
in two locations) at the same time if the code needed both endianness. 
Sometimes we'd find a bug, and it would take a week to figure out if it 
was a compiler bug or an application bug.



To facilitate byte swapping at endian boundaries (kernel or libraries),
I'm also considering developing a new GCC builtin that can byte-swap
whole structs in memory. There are limitations to this, e.g., unions
could not be supported in general. However, I still expect this to be
very useful.


There is a lot more stuff that will cause problems.  Byte-swapping FP 
doesn't make sense.  You can only byte swap a variable if you know its 
type, but you don't know the type of a va_list ap argument, so you can't 
call a big-endian vprintf from little-endian code and vice versa.  If 
you have a template expanded in both big and little endian code, you 
will run into problems unless name mangling changes to include endian 
info, which means you lose ABI compatibility with the current name 
mangling scheme.


There will also be trouble with variables in shared libraries that get 
initialized by the dynamic linker.  You will either have to add a new 
set of other-endian relocations, or else you will have to add code to 
byte-swap data after relocations are performed, probably via an init 
routine, which will have to run before the other init routines.  There 
is also the same issue with static linking, but that one is a little 
easier to handle, as you can use a post-linking pass to edit the binary 
and byte swap stuff that needs to be byte swapped after relocations are 
performed.


To handle endian boundaries, you willl need to force all declarations to 
have an endianness, and you will need to convert when calling a 
big-endian function from a little-endian function, and vice versa, and 
you will need to give an error if you see something you can't convert, 
like a va_list argument.  Besides the issue of the C library not 
changing endianness, you will likely also have third party libraries 
that you can't change the endianness of, and that need to be linked into 
your application.


Before you start, you should give some thought to how debugging will 
work.  DWARF does have an endianity attribute, you will need to set it 
correctly, or debugging will be hopeless.  Even if you set it correctly, 
if you have optimizations to remove unnecessary byte swapping, debugging 
optimized code will still be hard and people using the compiler will 
have to be trained on how to deal with endianness issues.


And there are lots of other problems, I don't have time to document them 
all, or even remember them all.  Personally, I think you are better off 
trying to fix the application to make it more portable.  Fixing the 
compiler is not a magic solution to the problem that is any easier than 
fixing the application.


Jim


Re: Byte swapping support

2017-09-13 Thread Joseph Myers
On Wed, 13 Sep 2017, Jim Wilson wrote:

> Intel has support for this in icc.  It took about 5 years for a small team to

And allegedly has patents in this area.

-- 
Joseph S. Myers
jos...@codesourcery.com


gcc-6-20170913 is now available

2017-09-13 Thread gccadmin
Snapshot gcc-6-20170913 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/6-20170913/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 6 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-6-branch 
revision 252740

You'll find:

 gcc-6-20170913.tar.xzComplete GCC

  SHA256=03edcdafdbb456ef1d12b466623b4a308ff7f41385f249b5ff7444dc2f838d3a
  SHA1=9cddeab7feb967f2f402f7f3c8b3a67a27e9a588

Diffs from 6-20170906 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-6
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


Re: Byte swapping support

2017-09-13 Thread Andi Kleen
Jürg Billeter  writes:
>
> I don't. The idea is to reverse scalar storage order for the whole
> userspace process and then add byte swapping to the Linux kernel when
> accessing userspace memory. This keeps userspace memory consistent
> with regards to endianness, which should lead to high compatibility
> with big-endian applications. Userspace memory access from the kernel
> always uses a small set of helper functions, which should make it
> easier to insert byte swapping at appropriate places.

I expect you'll find that it isn't that easy. There are a lot of opaque
copies that copy whole structures.

You'll need a whole compat layer, similar to 32<->64bit compat layers,
but actually doing more work because it has to handle all fields larger
than one byte, not just fields which differ in size.

-Andi


Re: Bit-field struct member sign extension pattern results in redundant

2017-09-13 Thread Michael Clark

> On 5 Sep 2017, at 9:35 AM, Michael Clark  wrote:
> 
>> 
>> On 19 Aug 2017, at 4:10 AM, Richard Henderson  wrote:
>> 
>> On 08/17/2017 03:29 PM, Michael Clark wrote:
>>> hand coded x86 asm (no worse because the sar depends on the lea)
>>> 
>>> sx5(int):
>>>   shl edi, 27
>>>   sar edi, 27
>>>   movsx eax, dl
>> 
>> Typo in the register, but I know what you mean.  More interestingly, edi
>> already has the sign-extended value, so "mov eax, edi" sufficies (saving one
>> byte or perhaps allowing better register allocation).
>> 
>> That said, if anyone is tweaking x86-64 patterns on this, consider
>> 
>> sx5(int):
>>  rorxeax, edi, 5
>>  sar eax, 27
>> 
>> where the (newish) bmi2 rorx instruction allows the data to be moved into 
>> place
>> while rotating, avoiding the extra move entirely.
> 
> The patterns might be hard to match unless we can get the gimple expansions 
> for bitfield access to use the mode of the enclosing type.
> 
> For bitfield accesses to SI mode fields under 8 bits on RISC-V, gcc is 
> generating two shifts on QI mode sub-registers, each with a sign-extend.
> 
> For bitfield accesses to SI mode fields under 16 bits on RISC-V, gcc is 
> generating two shifts on HI mode sub-registers, each with a sign-extend.
> 
> The bitfield expansion logic is selecting the narrowest type that can hold 
> the bitwidth of the field, versus the bitwidth of the enclosing type and this 
> appears to be in the gimple to rtl transform, possibly in expr.c.
> 
> Using the mode of the enclosing type for bitfield member access expansion 
> would likely solve this problem. I suspect that with the use of caches and 
> word sized register accesses, that this sort of change would be reasonable to 
> make for 32-bit and 64-bit targets.
> 
> I also noticed something I didn’t spot earlier. On RV32 the sign extend and 
> shifts are correctly coalesced into 27 bit shifts in the combine stage. We 
> are presently looking into another redundant sign extension issue on RV64 
> that could potentially be related. It could be that the shift coalescing 
> optimisation doesn’t happen unless the redundant sign extensions are 
> eliminated early in combine by simplify_rtx. i.e. the pattern is more complex 
> due to sign_extend ops that are not eliminated.
> 
> - https://cx.rv8.io/g/2FxpNw
> 
> RV64 and Aarch64 both appear to have the issue but with different expansions 
> for the shift and extend pattern due to the mode of access (QI or HI). Field 
> accesses above 16 bits create SI mode accesses and generate the expected 
> code. The RISC-V compiler has the #define SLOW_BYTE_ACCESS 1 patch however it 
> appears to make no difference in this case. SLOW_BYTE_ACCESS suppresses QI 
> mode and HI mode loads in some bitfield test cases when a struct is passed by 
> pointer but has no effect on this particular issue. This shows the codegen 
> for the fix to the SLOW_BYTE_ACCESS issue. i.e. proof that the compiler as 
> SLOW_BYTE_ACCESS defined to 1.
> 
> - https://cx.rv8.io/g/TyXnoG
> 
> A target independent fix that would solve the issue on ARM and RISC-V would 
> be to access bitfield members with the mode of the bitfield member's 
> enclosing type instead of the smallest mode that can hold the bitwidth of the 
> type. If we had a char or short member in the struct, I can understand the 
> use of QI and HI mode, as we would need narrorwer accesses due to alignment 
> issues, but in this case the member is in int and one would expect this to 
> expand to SI mode accesses if the enclosing type is SI mode.

It appears that 64-bit targets which promote to an SI mode subregister of DI 
mode register prevents combine from coalescing the shift and sign_extend. The 
only difference I can spot between RV32 which coalesces the shift during 
combine, and RV64 and other 64-bit targets which do not coalesce the shifts 
during combine, is that the 64-bit targets have promoted the shift operand to a 
SI mode subregister of a DI mode register, whereas on RV 32 the shift operand 
is simply an SI mode register. I suspect there is some code in simplify-rtx.c 
that is missing the sub register case however I’m still trying to isolate the 
code that coalesces shift and sign_extend.

I’ve found a similar, perhaps smaller, but related case where a shift is not 
coalesced however in this case it is also not coalesced on RV32

https://cx.rv8.io/g/fPdk2F

> riscv64:
> 
>   sx5(int):
> slliw a0,a0,3
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,3
> ret
> 
>   sx17(int):
> slliw a0,a0,15
> sraiw a0,a0,15
> ret
> 
> riscv32:
> 
>   sx5(int):
> slli a0,a0,27
> srai a0,a0,27
> ret
> 
>   sx17(int):
> slli a0,a0,15
> srai a0,a0,15
> ret
> 
> aarch64:
> 
>   sx5(int):
> sbfiz w0, w0, 3, 5
> asr w0, w0, 3
> ret
> 
>   sx17(int):
> sbfx w0, w0, 0, 17
> ret

Re: Byte swapping support

2017-09-13 Thread Eric Botcazou
> And there are lots of other problems, I don't have time to document them
> all, or even remember them all.  Personally, I think you are better off
> trying to fix the application to make it more portable.  Fixing the
> compiler is not a magic solution to the problem that is any easier than
> fixing the application.

Note that WRS' Diab compiler has got something equivalent to what GCC has got 
now, i.e. a way to tag a particular component in a structure as BE or LE.

-- 
Eric Botcazou