Re: RFC: Improving GCC8 default option settings
On Tue, Sep 12, 2017 at 4:57 PM, Wilco Dijkstra wrote: > Hi all, > > At the GNU Cauldron I was inspired by several interesting talks about > improving > GCC in various ways. While GCC has many great optimizations, a common theme is > that its default settings are rather conservative. As a result users are > required to enable several additional optimizations by hand to get good code. > Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was > mentioned repeatedly) which GCC could/should do as well. > > Here are a few concrete proposals to improve GCC's option settings which will > enable better code generation for most targets: > > * Make -fno-math-errno the default - this mostly affects the code generated > for > sqrt, which should be treated just like floating point division and not set > errno by default (unless you explicitly select C89 mode). +1. Math functions setting errno is a blast from the past that needs to die. That being said, this does to some extent depend on libm so perhaps the default needs to be target-dependent. > * Make -fno-trapping-math the default - another obvious one. From the docs: > "Compile code assuming that floating-point operations cannot generate >user-visible traps." > There isn't a lot of code that actually uses user-visible traps (if any - > many CPUs don't even support user traps as it's an optional IEEE feature). > So assuming trapping math by default is way too conservative since there is > no obvious benefit to users. As Mr. Myers explains, this is probably going a bit too far. I think by default whatever fp optimizations are allowed with FENV_ACCESS off is reasonable. > * Make -fomit-frame-pointer the default - various targets already do this at > higher optimization levels, but this could easily be done for all targets. > Frame pointers haven't been needed for debugging for decades, however if > there > are still good reasons to keep it enabled with -O0 or -O1 (I can't think of > any > unless it is for last-resort backtrace when there is no unwind info at a > crash), > we could just disable the frame pointer from -O2 onwards. Sounds reasonable. > These are just a few ideas to start. What do people think? I'd welcome > discussion > and other proposals for similar improvements. What about the default behavior if no options are given? I think a more reasonable default would be something roughly like -O2 -Wall or if debuggability is considered more important that speed & size, maybe -Og -g -Wall -- Janne Blomqvist
Re: Invalid free in standard library in trivial example with C++17 on gcc 7.2
On 12 September 2017 at 23:58, Dave Gittins wrote: > I confirmed this issue on x86_64 CentOS, and independently here: > https://wandbox.org/permlink/ncWqA9Zu3YEofqri > > Also fails on gcc trunk. > > Possibly related to bug 81338 "stringstream remains empty after being > moved into multiple times"? Although I see that one is fixed by Mr > Wakely. That only affected the new ABI, and was present in all modes not just C++17.
Re: Byte swapping support
On Tue, 2017-09-12 at 21:46 +0200, Eric Botcazou wrote: > > In contrast to the existing scalar storage order support for structs, the > > goal is to reverse the storage order for all memory operations to achieve > > maximum compatibility with the behavior on big-endian systems, as far as > > observable by the application. > > I presume that you'll well aware of this, but you cannot just reverse the > storage order for any memory operation; for example, an array of 4 chars in C > is stored the same way in big-endian and little-endian order, so you ought > not > to do byte swapping when you access it as a whole. So the above sentence > must > be read as "to reverse the storage order for all scalar memory operations". Yes, I'm aware of this. Thanks for stating this more clearly. > When the scalar_storage_order attribute was designed, discussions lead to the > conclusion that doing the swapping for any scalar memory operation, as > opposed > to any access to a scalar within a structure, would not be a significant step > forward to warrant the significantly more complex implementation (or the big > performance penalty if you do things very roughly). Was this considered significantly more complex because of the need to discriminate between native and reverse order? Or do you expect similar complexity even if this is not required (see my comment below)? > > The plan is to insert byte swapping instructions as part of the RTL > > expansion of GIMPLE assignments that access memory. This would leverage > > code that was added for -fsso-struct, keeping the code simple and > > maintainable. > > How do you discriminate scalars stored in native order and scalars stored in > reverse order though? That's the main difficulty of the implementation. I don't. The idea is to reverse scalar storage order for the whole userspace process and then add byte swapping to the Linux kernel when accessing userspace memory. This keeps userspace memory consistent with regards to endianness, which should lead to high compatibility with big-endian applications. Userspace memory access from the kernel always uses a small set of helper functions, which should make it easier to insert byte swapping at appropriate places. Jürg
Re: Byte swapping support
On Tue, 2017-09-12 at 08:52 -0700, H.J. Lu wrote: > Can you use __attribute__ ((scalar_storage_order)) in GCC 7? To support existing large code bases, the goal is to reverse storage order for all scalars, not just (selected) structs/unions. Also need to support taking the address of a scalar field, for example. C++ support will be required as well. Jürg
Re: Byte swapping support
> On Sep 13, 2017, at 5:51 AM, Jürg Billeter > wrote: > > On Tue, 2017-09-12 at 08:52 -0700, H.J. Lu wrote: >> Can you use __attribute__ ((scalar_storage_order)) in GCC 7? > > To support existing large code bases, the goal is to reverse storage > order for all scalars, not just (selected) structs/unions. Also need to > support taking the address of a scalar field, for example. C++ support > will be required as well. I wonder about that. It's inefficient to do byte swapping on local data; it is only useful and needed on external data. Data that goes to files for use by other-byte-order applications, or data that goes across a bus or network to consumers that have the other byte order. A byte swapped local variable only consumes cycles and instruction space to no purpose. My experience is that byte order marking of data is very useful, but it always is applied selectively just to those spots where it is needed. paul
Re: How to configure a bi-arch PowerPC GCC?
On Jul 20 2017, Sebastian Huber wrote: > Ok, so why do I get a "error: unrecognizable insn:"? How can I debug a > message like this: > > (insn 12 11 13 2 (set (reg:CCFP 126) > (compare:CCFP (reg:TF 123) > (reg:TF 124))) "test-v0.i":5 -1 > (nil)) This is supposed to be matched by the cmptf_internal1 pattern with -mabi=ibmlongdouble. Looks like your configuration defaults to -mabi=ieeelongdouble. Andreas. -- Andreas Schwab, SUSE Labs, sch...@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."
Re: Byte swapping support
> Was this considered significantly more complex because of the need to > discriminate between native and reverse order? Or do you expect similar > complexity even if this is not required (see my comment below)? The former. > I don't. The idea is to reverse scalar storage order for the whole > userspace process and then add byte swapping to the Linux kernel when > accessing userspace memory. This keeps userspace memory consistent > with regards to endianness, which should lead to high compatibility > with big-endian applications. Userspace memory access from the kernel > always uses a small set of helper functions, which should make it > easier to insert byte swapping at appropriate places. Well, if your userspace is entirely in reverse order, then of course things are totally different and I suspect that you'll pay the price in term of run time performance. This is not what the attribute was designed for, although we added the -fsso-struct switch at some point. -- Eric Botcazou
Re: RFC: Improving GCC8 default option settings
On Wed, Sep 13, 2017 at 3:21 AM, Michael Clark wrote: > >> On 13 Sep 2017, at 1:15 PM, Michael Clark wrote: >> >> - https://rv8.io/bench#optimisation >> - https://rv8.io/bench#executable-file-sizes >> >> -O2 is 98% perf of -O3 on x86-64 >> -Os is 81% perf of -O3 on x86-64 >> >> -O2 saves 5% space on -O3 on x86-64 >> -Os saves 8% space on -Os on x86-64 >> >> 17% drop in performance for 3% saving in space is not a good trade for a >> “general” size optimisation. It’s more like executable compression. > > Sorry fixed typo: > > -O2 is 98% perf of -O3 on x86-64 > -Os is 81% perf of -O3 on x86-64 > > -O2 saves 5% space on -O3 on x86-64 > -Os saves 8% space on -O3 on x86-64 > > The extra ~3% space saving for ~17% drop in performance doesn’t seem like a > good general option for size based on the cost in performance. > > Again. I really like GCC’s -O2 and hope that its binaries don’t grow in size > nor slow down. I think with GCC -Os and -O2 are essentially the same with the difference that -Os assumes regions are cold and thus to be optimized for size and -O2 assumes they are hot and thus to be optimized for speed in cases there is not heuristic proving otherwise. I know this doesn't 100% reflect implementation reality but it should be close. IMHO we should turn on flags we turn on with -fprofile-use and have some more nuances in optimize_*_for_{speed,size} as we now track profile quality more closely. I see -O1 as mostly worthless unless you are compiling machine-generated code that makes -O2+ go OOM/time. Apart from avoiding quadratic or worse algorithms -O1 sees no love. On its own -O3 doesn't add much (some loop opts and slightly more aggressive inlining/unrolling), so whatever it does we should consider doing at -O2 eventually. Richard.
Re: RFC: Improving GCC8 default option settings
On Wed, Sep 13, 2017 at 9:43 AM, Janne Blomqvist wrote: > On Tue, Sep 12, 2017 at 4:57 PM, Wilco Dijkstra > wrote: >> These are just a few ideas to start. What do people think? I'd welcome >> discussion >> and other proposals for similar improvements. > > What about the default behavior if no options are given? I think a > more reasonable default would be something roughly like > > -O2 -Wall > > or if debuggability is considered more important that speed & size, maybe > > -Og -g -Wall Enabling (some) warnings by default seems reasonable to me. Not sure about the rest though. This is something people can't seem to agree on. Some like warnings by default, some like optimizations by default. Some are against warnings by default, arguing that people like distro-builders have no need for warnings (for example). Some are against optimizations by default, because it would make compilation slower when they just want to check if some piece of code compiles successfully or not (for example). I think the only way to decide what options to enable by default is to first decide who your target audience is going to be. Who do you expect to run gcc with no options specified? If you ask me, that will mostly be students or beginners. Those who are experienced are more likely to use an automated build system where they specify build options only once and then forget about it. An unexperienced person would need warnings and debug info by default, and maybe a few simple optimizations that do not interfere with debugging. There can only be one set of default options, and which one you pick will depend on who your target audience is. You cannot please everyone. -- Kevin
Re: Byte swapping support
On Wed, 2017-09-13 at 13:08 +, paul.kon...@dell.com wrote: > > On Sep 13, 2017, at 5:51 AM, Jürg Billeter > > wrote: > > To support existing large code bases, the goal is to reverse storage > > order for all scalars, not just (selected) structs/unions. Also need to > > support taking the address of a scalar field, for example. C++ support > > will be required as well. > > I wonder about that. It's inefficient to do byte swapping on local > data; it is only useful and needed on external data. Data that goes > to files for use by other-byte-order applications, or data that goes > across a bus or network to consumers that have the other byte > order. A byte swapped local variable only consumes cycles and > instruction space to no purpose. Existing code that has been written under the assumption of big-endian memory layout can break on little-endian systems even with just local data (e.g., accessing individual bytes of a 32-bit scalar). It would obviously be best to fix all such code, however, that's not always feasible, unfortunately. Avoiding byte swapping for spilled registers where storage order is (normally) not observable by applications will hopefully reduce the performance overhead. Jürg
Re: RFC: Improving GCC8 default option settings
On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote: > On its own -O3 doesn't add much (some loop opts and slightly more > aggressive inlining/unrolling), so whatever it does we > should consider doing at -O2 eventually. Well, -O3 adds vectorization, which we don't enable at -O2 by default. Jakub
Re: RFC: Improving GCC8 default option settings
On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek wrote: > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote: >> On its own -O3 doesn't add much (some loop opts and slightly more >> aggressive inlining/unrolling), so whatever it does we >> should consider doing at -O2 eventually. > > Well, -O3 adds vectorization, which we don't enable at -O2 by default. As said, -fprofile-use enables it so -O2 should eventually do the same for "really hot code". Richard. > Jakub
Re: RFC: Improving GCC8 default option settings
On Dienstag, 12. September 2017 23:27:22 CEST Michael Clark wrote: > > On 13 Sep 2017, at 1:57 AM, Wilco Dijkstra wrote: > > > > Hi all, > > > > At the GNU Cauldron I was inspired by several interesting talks about > > improving GCC in various ways. While GCC has many great optimizations, a > > common theme is that its default settings are rather conservative. As a > > result users are required to enable several additional optimizations by > > hand to get good code. Other compilers enable more optimizations at -O2 > > (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should > > do as well. > > There are some nuances to -O2. Please consider -O2 users who wish use it > like Clang/LLVM’s -Os (-O2 without loop vectorisation IIRC). > > Clang/LLVM has an -Os that is like -O2 so adding optimisations that increase > code size can be skipped from -Os without drastically effecting > performance. > > This is not the case with GCC where -Os is a size at all costs optimisation > mode. GCC users option for size not at the expense of speed is to use -O2. > > Clang GCC > -Oz ~= -Os > -Os ~= -O2 > No. Clang's -Os is somewhat limited compared to gcc's, just like the clang -Og is just -O1. AFAIK -Oz is a proprietary Apple clang parameter, and not in clang proper. 'Allan
Re: RFC: Improving GCC8 default option settings
On Mittwoch, 13. September 2017 15:46:09 CEST Jakub Jelinek wrote: > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote: > > On its own -O3 doesn't add much (some loop opts and slightly more > > aggressive inlining/unrolling), so whatever it does we > > should consider doing at -O2 eventually. > > Well, -O3 adds vectorization, which we don't enable at -O2 by default. > Would it be possible to enable basic block vectorization on -O2? I assume that doesn't increase binary size since it doesn't unroll loops. 'Allan
Re: RFC: Improving GCC8 default option settings
> On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek wrote: > > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote: > >> On its own -O3 doesn't add much (some loop opts and slightly more > >> aggressive inlining/unrolling), so whatever it does we > >> should consider doing at -O2 eventually. > > > > Well, -O3 adds vectorization, which we don't enable at -O2 by default. > > As said, -fprofile-use enables it so -O2 should eventually do the same > for "really hot code". I don't see static profile prediction to be very useful here to find "really hot code" - neither in current implementation or future. The problem of -O2 is that we kind of know that only 10% of code somewhere matters for performance but we have no way to reliably identify it. I would make sense to have less agressive vectoriazaoitn at -O2 and more at -Ofast/-O3. Adding -Os and -Oz would make sense to me - even with hot/cold info it is not desriable to optimize as agressively for size as we do becuase mistakes happen and one do not want to make code paths 1000 times slower to save one byte of binary. We could handle this gratefully internally by having logic for "known to be cold" and "guessed to be cold". New profile code can make difference in this. Honza > > Richard. > > > Jakub
Re: RFC: Improving GCC8 default option settings
> On Wed, Sep 13, 2017 at 3:21 AM, Michael Clark wrote: > > > >> On 13 Sep 2017, at 1:15 PM, Michael Clark wrote: > >> > >> - https://rv8.io/bench#optimisation > >> - https://rv8.io/bench#executable-file-sizes > >> > >> -O2 is 98% perf of -O3 on x86-64 > >> -Os is 81% perf of -O3 on x86-64 > >> > >> -O2 saves 5% space on -O3 on x86-64 > >> -Os saves 8% space on -Os on x86-64 > >> > >> 17% drop in performance for 3% saving in space is not a good trade for a > >> “general” size optimisation. It’s more like executable compression. > > > > Sorry fixed typo: > > > > -O2 is 98% perf of -O3 on x86-64 > > -Os is 81% perf of -O3 on x86-64 > > > > -O2 saves 5% space on -O3 on x86-64 > > -Os saves 8% space on -O3 on x86-64 I am bit surprised you see only 8% of code size difference for -Os and -O3. I look into these numbers occasionally and it is usualy well over two digit number. http://hubicka.blogspot.cz/2014/04/linktime-optimization-in-gcc-2-firefox.html http://hubicka.blogspot.cz/2014/09/linktime-optimization-in-gcc-part-3.html has 42% code segment size reduction for Firefox and 19% for libreoffice Honza
Re: RFC: Improving GCC8 default option settings
On 12/09/17 16:57, Wilco Dijkstra wrote: [...] As a result users are required to enable several additional optimizations by hand to get good code. Other compilers enable more optimizations at -O2 (loop unrolling in LLVM was mentioned repeatedly) which GCC could/should do as well. [...] I'd welcome discussion and other proposals for similar improvements. What's the status of graphite? It's been around for years. Isn't it mature enough to enable these: -floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block by default for -O2? (And I'm not even sure those are the complete set of graphite optimization flags, or just the "useful" ones.)
Re: RFC: Improving GCC8 default option settings
On September 13, 2017 5:35:11 PM GMT+02:00, Jan Hubicka wrote: >> On Wed, Sep 13, 2017 at 3:46 PM, Jakub Jelinek >wrote: >> > On Wed, Sep 13, 2017 at 03:41:19PM +0200, Richard Biener wrote: >> >> On its own -O3 doesn't add much (some loop opts and slightly more >> >> aggressive inlining/unrolling), so whatever it does we >> >> should consider doing at -O2 eventually. >> > >> > Well, -O3 adds vectorization, which we don't enable at -O2 by >default. >> >> As said, -fprofile-use enables it so -O2 should eventually do the >same >> for "really hot code". > >I don't see static profile prediction to be very useful here to find >"really >hot code" - neither in current implementation or future. The problem of >-O2 is that we kind of know that only 10% of code somewhere matters for >performance but we have no way to reliably identify it. It's hard to do better than statically look at (ipa) loop depth. But shouldn't that be good enough? > >I would make sense to have less agressive vectoriazaoitn at -O2 and >more at >-Ofast/-O3. We tried that but the runtime effects were not offsetting the compile time cost. >Adding -Os and -Oz would make sense to me - even with hot/cold info it >is not >desriable to optimize as agressively for size as we do becuase mistakes >happen >and one do not want to make code paths 1000 times slower to save one >byte >of binary. > >We could handle this gratefully internally by having logic for "known >to be cold" >and "guessed to be cold". New profile code can make difference in this. > >Honza >> >> Richard. >> >> > Jakub
Re: RFC: Improving GCC8 default option settings
> >I don't see static profile prediction to be very useful here to find > >"really > >hot code" - neither in current implementation or future. The problem of > >-O2 is that we kind of know that only 10% of code somewhere matters for > >performance but we have no way to reliably identify it. > > It's hard to do better than statically look at (ipa) loop depth. But > shouldn't that be good enough? Only if you assume that you have whole program and understand indirect calls. There are some stats on this here http://ieeexplore.ieee.org/document/717399/ It shows that propagating static profile across whole progrma (which is just tiny bit more fancy than counting loop depth) sort of work statistically. I really do not have very high hopes of this reliably working in production compiler. We already have PRs for single function benchmark where deep loop nest is used ininitialization or so and the actual hard working part has small loop nest & gets identified as cold. As soon as you start propagating in whole program context, such local mistakes will become more comon. > > > > >I would make sense to have less agressive vectoriazaoitn at -O2 and > >more at > >-Ofast/-O3. > > We tried that but the runtime effects were not offsetting the compile time > cost. Yep, i remember that. Honza
Re: RFC: Improving GCC8 default option settings
On September 13, 2017 6:24:21 PM GMT+02:00, Jan Hubicka wrote: >> >I don't see static profile prediction to be very useful here to find >> >"really >> >hot code" - neither in current implementation or future. The problem >of >> >-O2 is that we kind of know that only 10% of code somewhere matters >for >> >performance but we have no way to reliably identify it. >> >> It's hard to do better than statically look at (ipa) loop depth. But >shouldn't that be good enough? > >Only if you assume that you have whole program and understand indirect >calls. >There are some stats on this here >http://ieeexplore.ieee.org/document/717399/ > >It shows that propagating static profile across whole progrma (which is >just >tiny bit more fancy than counting loop depth) sort of work >statistically. I >really do not have very high hopes of this reliably working in >production >compiler. We already have PRs for single function benchmark where deep >loop >nest is used ininitialization or so and the actual hard working part >has small >loop nest & gets identified as cold. > >As soon as you start propagating in whole program context, such local >mistakes >will become more comon. Heh, I would just make loop nests hot without globally making anything cold because of that. Basically sth like optimistic ipa profile propagation. Richard. >> >> > >> >I would make sense to have less agressive vectoriazaoitn at -O2 and >> >more at >> >-Ofast/-O3. >> >> We tried that but the runtime effects were not offsetting the compile >time cost. > >Yep, i remember that. > >Honza
Re: Byte swapping support
On 09/12/2017 02:32 AM, Jürg Billeter wrote: To support applications that assume big-endian memory layout on little- endian systems, I'm considering adding support for reversing the storage order to GCC. In contrast to the existing scalar storage order support for structs, the goal is to reverse the storage order for all memory operations to achieve maximum compatibility with the behavior on big-endian systems, as far as observable by the application. Intel has support for this in icc. It took about 5 years for a small team to make it work on a very large application. That includes both the compiler development and application development time. There are a lot of complicated issues that need to solved to make this work on real code, both in the compiler and in the application code. There is a Dr Dobbs article about some of it, search for "Writing a Bi-Endian Compiler" if you are interested. Even though they got it working, it was painful to use. Icc goes to a lot of trouble to optimize away unnecessary byte-swapping to improve performance, but that meant any variable could be big or little endian despite how it was declared, and could be different endianness at different places in the code, and could even be both endianness (stored in two locations) at the same time if the code needed both endianness. Sometimes we'd find a bug, and it would take a week to figure out if it was a compiler bug or an application bug. To facilitate byte swapping at endian boundaries (kernel or libraries), I'm also considering developing a new GCC builtin that can byte-swap whole structs in memory. There are limitations to this, e.g., unions could not be supported in general. However, I still expect this to be very useful. There is a lot more stuff that will cause problems. Byte-swapping FP doesn't make sense. You can only byte swap a variable if you know its type, but you don't know the type of a va_list ap argument, so you can't call a big-endian vprintf from little-endian code and vice versa. If you have a template expanded in both big and little endian code, you will run into problems unless name mangling changes to include endian info, which means you lose ABI compatibility with the current name mangling scheme. There will also be trouble with variables in shared libraries that get initialized by the dynamic linker. You will either have to add a new set of other-endian relocations, or else you will have to add code to byte-swap data after relocations are performed, probably via an init routine, which will have to run before the other init routines. There is also the same issue with static linking, but that one is a little easier to handle, as you can use a post-linking pass to edit the binary and byte swap stuff that needs to be byte swapped after relocations are performed. To handle endian boundaries, you willl need to force all declarations to have an endianness, and you will need to convert when calling a big-endian function from a little-endian function, and vice versa, and you will need to give an error if you see something you can't convert, like a va_list argument. Besides the issue of the C library not changing endianness, you will likely also have third party libraries that you can't change the endianness of, and that need to be linked into your application. Before you start, you should give some thought to how debugging will work. DWARF does have an endianity attribute, you will need to set it correctly, or debugging will be hopeless. Even if you set it correctly, if you have optimizations to remove unnecessary byte swapping, debugging optimized code will still be hard and people using the compiler will have to be trained on how to deal with endianness issues. And there are lots of other problems, I don't have time to document them all, or even remember them all. Personally, I think you are better off trying to fix the application to make it more portable. Fixing the compiler is not a magic solution to the problem that is any easier than fixing the application. Jim
Re: Byte swapping support
On Wed, 13 Sep 2017, Jim Wilson wrote: > Intel has support for this in icc. It took about 5 years for a small team to And allegedly has patents in this area. -- Joseph S. Myers jos...@codesourcery.com
gcc-6-20170913 is now available
Snapshot gcc-6-20170913 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/6-20170913/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 6 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-6-branch revision 252740 You'll find: gcc-6-20170913.tar.xzComplete GCC SHA256=03edcdafdbb456ef1d12b466623b4a308ff7f41385f249b5ff7444dc2f838d3a SHA1=9cddeab7feb967f2f402f7f3c8b3a67a27e9a588 Diffs from 6-20170906 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-6 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Byte swapping support
Jürg Billeter writes: > > I don't. The idea is to reverse scalar storage order for the whole > userspace process and then add byte swapping to the Linux kernel when > accessing userspace memory. This keeps userspace memory consistent > with regards to endianness, which should lead to high compatibility > with big-endian applications. Userspace memory access from the kernel > always uses a small set of helper functions, which should make it > easier to insert byte swapping at appropriate places. I expect you'll find that it isn't that easy. There are a lot of opaque copies that copy whole structures. You'll need a whole compat layer, similar to 32<->64bit compat layers, but actually doing more work because it has to handle all fields larger than one byte, not just fields which differ in size. -Andi
Re: Bit-field struct member sign extension pattern results in redundant
> On 5 Sep 2017, at 9:35 AM, Michael Clark wrote: > >> >> On 19 Aug 2017, at 4:10 AM, Richard Henderson wrote: >> >> On 08/17/2017 03:29 PM, Michael Clark wrote: >>> hand coded x86 asm (no worse because the sar depends on the lea) >>> >>> sx5(int): >>> shl edi, 27 >>> sar edi, 27 >>> movsx eax, dl >> >> Typo in the register, but I know what you mean. More interestingly, edi >> already has the sign-extended value, so "mov eax, edi" sufficies (saving one >> byte or perhaps allowing better register allocation). >> >> That said, if anyone is tweaking x86-64 patterns on this, consider >> >> sx5(int): >> rorxeax, edi, 5 >> sar eax, 27 >> >> where the (newish) bmi2 rorx instruction allows the data to be moved into >> place >> while rotating, avoiding the extra move entirely. > > The patterns might be hard to match unless we can get the gimple expansions > for bitfield access to use the mode of the enclosing type. > > For bitfield accesses to SI mode fields under 8 bits on RISC-V, gcc is > generating two shifts on QI mode sub-registers, each with a sign-extend. > > For bitfield accesses to SI mode fields under 16 bits on RISC-V, gcc is > generating two shifts on HI mode sub-registers, each with a sign-extend. > > The bitfield expansion logic is selecting the narrowest type that can hold > the bitwidth of the field, versus the bitwidth of the enclosing type and this > appears to be in the gimple to rtl transform, possibly in expr.c. > > Using the mode of the enclosing type for bitfield member access expansion > would likely solve this problem. I suspect that with the use of caches and > word sized register accesses, that this sort of change would be reasonable to > make for 32-bit and 64-bit targets. > > I also noticed something I didn’t spot earlier. On RV32 the sign extend and > shifts are correctly coalesced into 27 bit shifts in the combine stage. We > are presently looking into another redundant sign extension issue on RV64 > that could potentially be related. It could be that the shift coalescing > optimisation doesn’t happen unless the redundant sign extensions are > eliminated early in combine by simplify_rtx. i.e. the pattern is more complex > due to sign_extend ops that are not eliminated. > > - https://cx.rv8.io/g/2FxpNw > > RV64 and Aarch64 both appear to have the issue but with different expansions > for the shift and extend pattern due to the mode of access (QI or HI). Field > accesses above 16 bits create SI mode accesses and generate the expected > code. The RISC-V compiler has the #define SLOW_BYTE_ACCESS 1 patch however it > appears to make no difference in this case. SLOW_BYTE_ACCESS suppresses QI > mode and HI mode loads in some bitfield test cases when a struct is passed by > pointer but has no effect on this particular issue. This shows the codegen > for the fix to the SLOW_BYTE_ACCESS issue. i.e. proof that the compiler as > SLOW_BYTE_ACCESS defined to 1. > > - https://cx.rv8.io/g/TyXnoG > > A target independent fix that would solve the issue on ARM and RISC-V would > be to access bitfield members with the mode of the bitfield member's > enclosing type instead of the smallest mode that can hold the bitwidth of the > type. If we had a char or short member in the struct, I can understand the > use of QI and HI mode, as we would need narrorwer accesses due to alignment > issues, but in this case the member is in int and one would expect this to > expand to SI mode accesses if the enclosing type is SI mode. It appears that 64-bit targets which promote to an SI mode subregister of DI mode register prevents combine from coalescing the shift and sign_extend. The only difference I can spot between RV32 which coalesces the shift during combine, and RV64 and other 64-bit targets which do not coalesce the shifts during combine, is that the 64-bit targets have promoted the shift operand to a SI mode subregister of a DI mode register, whereas on RV 32 the shift operand is simply an SI mode register. I suspect there is some code in simplify-rtx.c that is missing the sub register case however I’m still trying to isolate the code that coalesces shift and sign_extend. I’ve found a similar, perhaps smaller, but related case where a shift is not coalesced however in this case it is also not coalesced on RV32 https://cx.rv8.io/g/fPdk2F > riscv64: > > sx5(int): > slliw a0,a0,3 > slliw a0,a0,24 > sraiw a0,a0,24 > sraiw a0,a0,3 > ret > > sx17(int): > slliw a0,a0,15 > sraiw a0,a0,15 > ret > > riscv32: > > sx5(int): > slli a0,a0,27 > srai a0,a0,27 > ret > > sx17(int): > slli a0,a0,15 > srai a0,a0,15 > ret > > aarch64: > > sx5(int): > sbfiz w0, w0, 3, 5 > asr w0, w0, 3 > ret > > sx17(int): > sbfx w0, w0, 0, 17 > ret
Re: Byte swapping support
> And there are lots of other problems, I don't have time to document them > all, or even remember them all. Personally, I think you are better off > trying to fix the application to make it more portable. Fixing the > compiler is not a magic solution to the problem that is any easier than > fixing the application. Note that WRS' Diab compiler has got something equivalent to what GCC has got now, i.e. a way to tag a particular component in a structure as BE or LE. -- Eric Botcazou