Re: LTO remapping/deduction of machine modes of types/decls
On Tue, 10 Jan 2017, Alexander Monakov wrote: > On Tue, 10 Jan 2017, Richard Biener wrote: > > In general I think they should match. But without seeing concrete > > examples of where they do not I can't comment on whether such exceptions > > make sense. For example if you adjust a DECLs alignment and then > > re-layout it I'd expect you might get a non-BLKmode mode for an > > aggregate in some circumstances -- but then decl and type are not 1:1 > > compatible (due to different alignment), but this case is clearly desired > > as requiring type copies for the sake of alignment would be wasteful. > > Thanks; Vlad will follow up with (I believe) a different kind of mismatches > originating in the C++ front-end. > > > > For our original goal, I think we'll switch to the other solution I've > > > outlined in the opening mail, i.e. propagating mode tables at WPA stage > > > and keeping enough information to know if the section comes from the > > > host or native compiler. > > > > So add a hack ontop of the hack? Ugh. So why exactly doesn't it > > already work? It looks like decls and types have their modes > > "fixed" with the per-file mode table at WPA time. So what is missing > > is to "fix" modes in the per-function sections that are not touched > > by WPA? > > WPA re-streams packed function bodies as-is, so anything referred to > from within just the body won't be subject to mode remapping; I think > only modes of toplevel declarations and functions' arguments will be > remapped. And I believe it wouldn't be acceptable to unpack/remap/repack > function bodies at WPA stage (it's contrary to LTO scalability goal). Yes indeed. But this means the mode-maps have to be per function section (with possibly a way to "share" them?). Or we need a way to annotate function sections with "no need to re-map" as the native nvptx sections don't need remapping and the others all use the same map? Richard.
Re: LTO remapping/deduction of machine modes of types/decls
On Wed, 11 Jan 2017, Richard Biener wrote: > > WPA re-streams packed function bodies as-is, so anything referred to > > from within just the body won't be subject to mode remapping; I think > > only modes of toplevel declarations and functions' arguments will be > > remapped. And I believe it wouldn't be acceptable to unpack/remap/repack > > function bodies at WPA stage (it's contrary to LTO scalability goal). > > Yes indeed. But this means the mode-maps have to be per function > section (with possibly a way to "share" them?). Or we need a way > to annotate function sections with "no need to re-map" as the > native nvptx sections don't need remapping and the others all use > the same map? Right, the latter: we know that sections coming from the native compiler already have the right modes and thus need no remapping, and the sections coming from the host compiler all need remapping (and will use the same mapping). Prefixes of per-function section names already carry the distinction (".gnu.lto_foo" vs. ".gnu.offload_lto_foo"). Alexander
k-byte memset/memcpy/strlen builtins
Hi, When examining the performance of some test cases on s390 I realized that we could do better for constructs like 2-byte memcpys or 2-byte/4-byte memsets. Due to some s390-specific architectural properties, we could be faster by e.g. avoiding excessive unrolling and using dedicated memory instructions (or similar). For 1-byte memset/memcpy the builtin functions provide a straightforward way to achieve this. At first sight it seemed possible to extend tree-loop-distribution.c to include the additional variants we need. However, multibyte memsets/memcpys are not covered by the C standard and I'm therefore unsure if such an approach is preferable or if there are more idiomatic ways or places where to add the functionality. The same question goes for 2-byte strlen. I didn't see a recognition pattern for strlen (apart from optimizations due to known string length in tree-ssa-strlen.c). Would it make sense to include strlen recognition and subsequently handling for 2-byte strlen? The situation might of course more complicated than memset because of encodings etc. My snippet in question used a fixed-length encoding of 2 bytes, however. Another simple idea to tackle this would be a peephole optimization but I'm not sure if this is really feasible for something like memset. Wouldn't the peephole have to be recursive then? Regards Robin
Re: Help with integrating my test program into dejagnu
On Jan 10, 2017, at 9:13 PM, Daniel Santos wrote: > I've gotten rid of the Makefile and everything is run now from msabi.exp. > I've also gotten rid of the header file, now that I know how to define a > "_noinfo" fn pointer, so it's down to just 4 files: msabi.exp, gen.cc, > msabi.c and do_test.S. Sounds better. > After running using DG_TORTURE_OPTIONS, But why? I think you missed what you're testing. You aren't creating or looking for bugs in the optimizer. Your test case isn't for an optimizer, therefore, you should not torture the poor test case. I think what you are testing is argument passing. That typically is the decision about what bits are where, and that is an optimization irrelevant thing to test. > it became clear that the resulting program was just too big, so I've modified > the generator so that the test can be done in little bits. A sum of little bits always is likely more costly the just one large thing. I don't think there is an economy to be had there, other than the ability to say test case 15 fails, and you want a person to be able to drill into test case 15 by itself without the others around. With a well structured large test case, it should be clear how each subpart can be separated out and run as a single small test case. For example: test1() { ... } main() { test1(); test2(); [ ... ] } here, we see that we remove 99% of the test case, and run just a single case. Normal, manual edit leaving just one line, and then the transitive closure of the one test routine. I think if you time it, you discover that you can fit in more cases this way, then if you break them up; also, if you torture, you can't fit in as many cases in the time given. This is at the heart of why I don't think you want to torture. > Otherwise, the build eats 6GiB+ and takes forever on the final set of flags. So, one review point will be, is the added time testing at all useful in general. That's an open review point. The compiler can be easily damaged with random edits, but we have fairly good coverage that will catch most of it. We typically don't spend time in the test suite methodically to catch every single thing that can go wrong, just the things that usually do go wrong based upon reported bugs. What is the added time in seconds to test on what type of machine? > And now for 50 questions. :) Am I using DG_TORTURE_OPTIONS correctly I want to say no. See above. No one should ever use it, unless they have a very specific well though out reason. I've not heard the reason in this case. > or should such a test only exist under gcc.torture? gcc.torture is for a very narrow and specific type of bug. There are bugs that people that work on the optimizer add for test cases that go though the optimizer that they want to ensure that the bug they just fixed, doesn't re-appear. So, the first question, are you working on the optimizer? If not, then it would likely be inappropriate. > I'm not sure if I'm managing this correctly, as I'm calling pass/fail $subdir > after each iteration of the test (should this only be called once?). No, if you did it, you would call it once per iteration, and you would mix in the torture flags to the pass/fail line. pass "$file.c $torture_option" would be the typical, in your code, it would be $generator_args. > Also, being that the generator is C++, I've added HOSTCXX and HOSTCXXFLAGS to > site.exp, I hope that's OK. Hum. I worry about the knock on effect of some sort. Generally I don't like adding anything to site.exp unless needed. In this case, I think it'd be fine. It is the most simple and direct way to do it. > Finally, would you please look at my runtest_msabi procedure to make sure > that I'm doing the build correctly? I'm using "remote_exec build" for most > of it and I'm not 100% certain if that is the correct way to do it. Yeah, close enough to likely not worry about it too much. If you wanted to improve it, then next step would be to remove the isnative part and finish the code for cross builds, and just after that finish the code for canadian cross builds. A canadian cross is one in which the build machine and the host machine are different. With the isnative, you can get the details of host/build and target machine completely wrong and pay no price for it. Once you remove it, you then have understand which code works for which system and ensure it works. A cross build loosely, is one in which the target machine and the host machine are different. The reason why I suggested isnative, is then you don't have to worry about it, and you can punt the finishing to a cross or canadian cross person. For them, it is rather trivial to clean up the test case to get it to work i a cross environment. Without testing, it is easy enough to get wrong. Also, for them, testing it is then trivial. If you can find someone that can test in a cross environment and report back if it works
Re: k-byte memset/memcpy/strlen builtins
On January 11, 2017 5:16:43 PM GMT+01:00, Robin Dapp wrote: >Hi, > >When examining the performance of some test cases on s390 I realized >that we could do better for constructs like 2-byte memcpys or >2-byte/4-byte memsets. Due to some s390-specific architectural >properties, we could be faster by e.g. avoiding excessive unrolling and >using dedicated memory instructions (or similar). Not sure why you mention memcpy, how does that depend on 'element size'? >For 1-byte memset/memcpy the builtin functions provide a >straightforward >way to achieve this. At first sight it seemed possible to extend >tree-loop-distribution.c to include the additional variants we need. >However, multibyte memsets/memcpys are not covered by the C standard >and >I'm therefore unsure if such an approach is preferable or if there are >more idiomatic ways or places where to add the functionality. Yes, for memset with larger element we could add an optab plus internal function combination and use that when the target wants. Or always use such IFN and fall back to loopy expansion. >The same question goes for 2-byte strlen. I didn't see a recognition >pattern for strlen (apart from optimizations due to known string length >in tree-ssa-strlen.c). Would it make sense to include strlen >recognition >and subsequently handling for 2-byte strlen? The situation might of I'd say a multibyte memchr might make sense, but strlen specifically? Not sure. Likewise multibyte memcmp. Richard. >course more complicated than memset because of encodings etc. My >snippet >in question used a fixed-length encoding of 2 bytes, however. > >Another simple idea to tackle this would be a peephole optimization but >I'm not sure if this is really feasible for something like memset. >Wouldn't the peephole have to be recursive then? > >Regards > Robin
Re: k-byte memset/memcpy/strlen builtins
On Wed, 2017-01-11 at 17:16 +0100, Robin Dapp wrote: > Hi, Hi Robin, I thought I'd share some of what I've run into while doing similar things for the rs6000 target. First off, be aware that glibc does some macro expansion things to try to handle 1/2/3 byte string operations in some cases. Secondly, the way I approached this was to use the patterns defined in optabs.def for these things: OPTAB_D (cmpmem_optab, "cmpmem$a") OPTAB_D (cmpstr_optab, "cmpstr$a") OPTAB_D (cmpstrn_optab, "cmpstrn$a") OPTAB_D (movmem_optab, "movmem$a") OPTAB_D (setmem_optab, "setmem$a") OPTAB_D (strlen_optab, "strlen$a") If you define movmemsi, that should get used by expand_builtin_memcpy for any memcpy call that it sees. The constraints I was able to find when implementing cmpmemsi for memcmp were: * don't compare past the given length (obviously) * don't read past the given length * except it's ok to do so if you can prove via alignment or runtime check that you are not going to cause a pagefault. Not crossing a 4k boundary seems to be generally viewed as acceptable. I would recommend looking at preprocessed code to make sure no funny business is happening, and then look at your .md files. It looks to me like s390 has got both movmem and strlen patterns there already. If I understand correctly you are wanting to do multi-byte characters. Seems to me you need to follow the path Richard Biener suggests and make optab expansions that handle wider chars and then perhaps map wcslen et. al. to them? Aaron > > For 1-byte memset/memcpy the builtin functions provide a > straightforward > way to achieve this. At first sight it seemed possible to extend > tree-loop-distribution.c to include the additional variants we need. > However, multibyte memsets/memcpys are not covered by the C standard > and > I'm therefore unsure if such an approach is preferable or if there > are > more idiomatic ways or places where to add the functionality. > > The same question goes for 2-byte strlen. I didn't see a recognition > pattern for strlen (apart from optimizations due to known string > length > in tree-ssa-strlen.c). Would it make sense to include strlen > recognition > and subsequently handling for 2-byte strlen? The situation might of > course more complicated than memset because of encodings etc. My > snippet > in question used a fixed-length encoding of 2 bytes, however. > > Another simple idea to tackle this would be a peephole optimization > but > I'm not sure if this is really feasible for something like memset. > Wouldn't the peephole have to be recursive then? > > Regards > Robin > -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Worse code after bbro?
On Thu, Jan 05, 2017 at 07:39:21PM +0100, Jan Hubicka wrote: > In fact cfglayout was invented to implement bb-reorder originally :) So, hrm, are there any passes we *want* to do in non-cfglayout mode? Segher
Re: Help with integrating my test program into dejagnu
A test [istarget x86_64-*-gnu] is wrong; i?86-* -m64 should always be handled exactly the same as x86_64-* -m64. You need to work out which ABIs (-m32, -mx32, -m64) this testing is meaningful for. Then, allow both x86_64- and i?86- targets, together with an appropriate effective-target test ("lp64" to allow just -m64, for example). -- Joseph S. Myers jos...@codesourcery.com
Re: Help with integrating my test program into dejagnu
On 01/11/2017 12:25 PM, Joseph Myers wrote: A test [istarget x86_64-*-gnu] is wrong; i?86-* -m64 should always be handled exactly the same as x86_64-* -m64. You need to work out which ABIs (-m32, -mx32, -m64) this testing is meaningful for. Then, allow both x86_64- and i?86- targets, together with an appropriate effective-target test ("lp64" to allow just -m64, for example). Thank your help! (testsuite/target-supports.exp is quite a library!) So this test is 64-bit only and makes heavy use of gcc extensions (mostly attributes ms_abi). It's aim is to test pro & epilogue creation for 64-bit ms_abi functions that call sysv_abi functions (these are the ones that result in the massive SSE clobbers). I do not believe msvc supports sysv_abi functions, so the test the test can't be built there. As such, I'm only intending to run this when gcc is being tested on 64-bit x86 platforms. The resulting program must be executed on the target as well, so I was using [isnative]. Would this then be the correct test? if { (![istarget x86_64-*] && ![istarget i?86-*]) || ![is-effective-target lp64] || ![isnative] } then { unsupported "$subdir" return } Thanks, Daniel
Throwing exceptions from a .so linked with -static-lib* ?
TL;DR: I have an issue where if I have a .so linked with -static-lib* making all STL symbols private, and if I throw an exception out of that .so to be caught by the caller, then I get a SIGABRT from a gcc_assert() down in the guts of the signal handling: #0 0x7773a428 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7773c02a in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x0040e938 in _Unwind_SetGR (context=, index=, val=) at /usr/src/cc/gcc-6.2.0/libgcc/unwind-dw2.c:271 271 gcc_assert (index < (int) sizeof(dwarf_reg_size_table)); Should it be possible to do this successfully or am I doomed to failure? More details and a test case below. More detail: I'm trying to distribute a shared library built with the latest version of C++ (well, GCC 6.2 with C++14) on GNU/Linux. I compile it with an older sysroot, taken from RHEL 6.3 (glibc 2.12) so it will run on older systems. My .so is written in C++ and programs that link it will also be written in C++ although they may be compiled and linked with potentially much older versions of GCC (like, 4.9 etc.) I'm not worried about programs compiled with clang or whatever at this point. Because I want to use new C++ but want users to be able to use my .so on older systems, I link with -static-libgcc -static-libstdc++. Because I don't want to worry about security issues etc. in system libraries, I don't link anything else statically. I also use a linker script to force all symbols (even libstdc++ symbols) to be private to my shared library except the ones I want to publish. Using "nm | grep ' [A-TV-Z] '" I can see that no other symbols besides mine are public. However, if my library throws an exception which I expect to be handled by the program linking my library, then I get a SIGABRT, as above; the full backtrace is: #0 0x7773a428 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7773c02a in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x0040e938 in _Unwind_SetGR (context=, index=, val=) at /usr/src/cc/gcc-6.2.0/libgcc/unwind-dw2.c:271 #3 0x004012a2 in __gxx_personality_v0 () #4 0x77feb903 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x43b890, context=context@entry=0x7fffe330) at /usr/src/cc/gcc-6.2.0/libgcc/unwind.inc:62 #5 0x77febf8a in _Unwind_RaiseException (exc=0x43b890) at /usr/src/cc/gcc-6.2.0/libgcc/unwind.inc:131 #6 0x77fde84b in __cxa_throw () from /home/psmith/src/static-eh/libmylib.so #7 0x77fddecb in MyLib::create () at mylib.cpp:2 #8 0x00400da4 in main () at myprog.cpp:2 I should note that if I use the GCC 5.4 that comes standard on my OS rather than my locally-built version I get identical behavior and backtrace (except not as much debuggability of course). So I don't think it's an incorrect build. If I don't use -static-libstdc++ with my .so then it doesn't fail. Also if I don't use a linker script to hide all the C++ symbols it doesn't fail (but of course anyone who links with my shared library will use my copy of the STL). Here's a repro case (this shows the problem on my Ubuntu GNOME 16.04 GNU/Linux system with GCC 5.4 and binutils 2.26.1): ~$ cat mylib.h class MyLib { public: static void create(); }; ~$ cat mylib.cpp #include "mylib.h" void MyLib::create() { throw 42; } ~$ cat myprog.cpp #include "mylib.h" int main() { try { MyLib::create(); } catch (...) { return 0; } return 1; } ~$ cat ver.map { global: _ZN5MyLib6createEv; local: *; }; ~$ g++ -I. -g -fPIC -static-libgcc -static-libstdc++ \ -Wl,--version-script=ver.map -Wl,-soname=libmylib.so \ -shared -o libmylib.so mylib.cpp ~$ g++ -I. -g -fPIC -L. -Wl,-rpath="\$ORIGIN" -o myprog myprog.cpp \ -lmylib ~$ ./myprog Aborted (core dumped) Now if I rebuild without the --version-script argument or without -static-libstdc++, I get success as expected: ~$ ./myprog ~$ echo $? 0